Brought to you by Michael and Brian - take a Talk Python course or get Brian's pytest book


« Return to show page

Transcript for Episode #167:
Cheating at Kaggle and uWSGI in prod

Recorded on Wednesday, Jan 29, 2020.

00:00 KENNEDY: Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is Episode 167, recorded January 29th 2020. I'm Michael Kennedy, and Brian Okken is away. We miss you Brian, but I have a very special guest to join me. Vicki Boykis, welcome Vicki.

00:18 BOYKIS: Thanks for having me.

00:19 KENNEDY: Yeah, it's great to have you here. I'm excited to get your take on this weeks Python news. I know you found some really interesting and controversial ones that we're going to jump into, and that will be great. Also great is Datadog, they are sponsoring this episode so check them out at pythonbytes.fm/datadog. I'll tell you more about them later. Vicki, let me kick us off with command line interface libraries for lack of a better word. So, back on Episode 164... so three episodes ago. I talked about this thing called Typer. Have you heard of typer?

00:51 BOYKIS: I have not, but I have heard of Click so I'm curious to see how this differs from that, even.

00:56 KENNEDY: Yeah, so this is sort of a competitor to Click. Typer is super cool because what it does is it uses native Python concepts to build out your CLI rather than attributes where you describe everything. So, for example, you can have a function, and you just say this function takes a name:str, to give it a type or an int or whatever. And then Typer can automatically use the type and the name of the parameters and stuff to generate your help and the inbound arguments and so on. So that's pretty cool right?

01:31 BOYKIS: Yeah, seems like a great excuse to start using type annotations if you haven't yet.

01:34 KENNEDY: Yeah, exactly, very nice that it leverages type annotations, hence the name Typer. So, our listeners are great, they always send in stuff that we haven't heard about or "I can't believe you didn't talk about this other thing!" So, Marcello sent in a message and says, "Hey you should talk about Clize, which turns functions into command line interface." So, Clize is really cool and it's really similar in regard to how it works to Typer. So what you do is you create function, you give them variables. You don't have to use types in the sense that Typer does, but you have positional arguments and you have keyword only arguments. And, you know, Python has that syntax that very few people use but is cool if you want to enforce it. Where you can say here are some parameters to a function ,*, here are some more and the stuff after the star has to be addressed as a keyword argument, right?

02:30 BOYKIS: Mhmm

02:30 KENNEDY: So it leverages that kind of stuff. So you can say, like their example says, "Here's a hello world function, and it takes a name, which has a default of None, and then *, no capitalize is False." And it gives it a default value. So all you got to do to run it is basically import Clize.run and call run on your function. And then what it does is it verifies those arguments about whether or not they're required and then it'll convert the keyword arguments to -- this or that. So, like, --no-capitalize will pass true to no-capitalize. If you admit it'll pass to whatever the default is I guess, so False. So there's positional ones where you don't say the name but then also this cool way of adding on these --capitalize and so on. So, it seems like a really cool and pretty extensive library for building command line interfaces.

03:21 BOYKIS: Yeah, so this seems like it'd be good if you have a lot of parameters that you have to pass in. I'm thinking specifically of some of the work that you would do in the cloud like any aws command line.

03:31 KENNEDY: Yeah for sure.

03:32 BOYKIS: Or similar.

03:33 KENNEDY: Another thing that's cool is it will take your doc strings and use those as help messages.

03:38 BOYKIS: Oh, that's neat.

03:39 KENNEDY: Yeah, so you know in some editors you can type triple quote enter and it'll generate, you know, here's the summary of the method and then here's the arguments and you can just put or you can write them out, of course. And then here's the descriptions about each parameter. Those become help messages about each command in there. It's really nice and I like how it just uses pure Python, sort of similar to Typer in that regard that you don't put forty levels of decorators on top of things and then reference other parts of that. You just say, here's some Python code. I want to treat it as a command line interface, clize.run.

04:15 BOYKIS: That is pretty cool. So, there's now a lot of choices if you want to do command line interfaces.

04:20 KENNEDY: Yeah, definitely, and Click is good, and it's very popular and Argparse as well, but I'm kind of fan of the pure Python ones that don't require me to go do a whole bunch of extra stuff. So, yeah, definitely loving that. You know what, I bet that Kaggle is not loving what you're talking about next.

04:38 BOYKIS: Well I think they might be but we'll see.

04:40 KENNEDY: Yeah, we'll see. Tell us about Kaggle and then what the big news here is.

04:44 BOYKIS: Yeah, so there was a dust-up by Kaggle a couple of weeks ago, so just as a little bit of background, Kaggle is a platform that is now owned by Google that allows data scientists to find datasets to learn data science and most importantly it's probably known for letting people participate in machine learning competitions. That's kind of how it gained it's popularity and notoriety.

05:04 KENNEDY: That's how I know it.

05:05 BOYKIS: Yep, and so people can sharpen their data science and then modeling skills on it. They recently, I want to say last fall, hosted a competition, that was about analyzing pet shelter data. And this resulted in enormous controversy. What happened is there's this website that's called petfinder.my, that helps people find pets to rescue in Malaysia from shelters. In 2019 they announced a collaboration with Kaggle to create a machine learning predictor algorithm, which pets would be most likely to be adopted based on the meta data descriptions on the site. So, if you go to petfinder.my you'd see that they'd have a picture of the pet, a description, how old they are, and some other attributes about them.

05:49 KENNEDY: Right, were they vaccinated? Or things like that right? You might think well if they're vaccinated or they're neutered or spayed they may be more likely to be adopted but you don't necessarily know. So that was kind of some, what are the important factors was this whole competition right?

06:06 BOYKIS: Yeah, the goal was to help the shelters write better descriptions so the pets would be adopted more quickly. After several months, they held this competition for several months, and there was a contestant that won. He was previously called a Kaggle grandmaster. So, he won a lot of different stuff on Kaggle before and he won ten thousand dollars in prize money, but then what happened is they started to validate all of his data. Because when you do a Kaggle competition you then submit all of you data and all of your results, and your notebooks, and your code.

06:37 KENNEDY: Like how you train your models and stuff like that, right?

06:41 BOYKIS: Yeah, all of that stuff. And then what happened was pet finder wanted to put this model into production so you initially have something like a Jupyter or a Colab notebook, in this case, and the idea is that now you would be able to integrate it into the pet finder website. So they can actually use these predictors to fine-tune how they post the content. So, when a volunteer, who is Benjamin Minixhofer, offered to put the algorithm into production and he started to look at it, he found that there was a huge discrepancy between the first and second place entries in the contest. So what happened was, to get a little more into the technical aspect, the data they gave to contestants asked them to predict the speed at which a pet would be adopted from one to five and included some of the features you talked about like animal breed, coloration, all that stuff. The initial training set had fifteen thousand animals and then after a couple months the contestants were given four thousand animals that had not been seen before as a test of how accurate they were. What the winner did was he actually scraped basically most of the website so that he got that four thousand set, the validation set also, and he had the validation set in his notebook. Basically what he did was he use the md5 library to create a hash for each unique pet. And then he looked up the adoption score for each of those pets. Basically when they were adopted from that external dataset, and then there were about thirty-five hundred that had overlaps with the validation set. Then he did column manipulation in pandas to get at the hidden prediction variable for every tenth pet. Not every single pet but every tenth pet so it didn't look too obvious.

08:26 KENNEDY: Right, so he gave himself a ten percent headstart or advantage or something like that.

08:30 BOYKIS: Exactly. He replaced the prediction that should have been generated by the algorithm with the actual value. Then he did a dictionary lookup between the initial md5 hash and the value of the hash. And this was all obfuscated in a separate function that happened in his data.

08:47 KENNEDY: Wow, and so they must have been looking at this going, "What does the md5 hash of the pet attributes have to do with anything?" You know what I mean right? The hashes are meant to obscure stuff, right?

08:59 BOYKIS: Right, yeah.

09:00 KENNEDY: What was the fallout?

09:02 BOYKIS: The fallout was this guy worked at h2o.ai and so he was fired from there. Kaggle also issued an apology where they explained exactly what happened and they expressed the hope that this didn't mean that every contest going forward would be viewed with suspicion for more openness and collaboration going forward.

09:21 KENNEDY: Wow.

09:22 BOYKIS: It was an amazing catch by them though.

09:23 KENNEDY: Yeah, that's such a good catch. I'm so glad that Benjamin did that and caught the whole deal here. Now, did Kaggle actually end up paying him the ten thousand before they caught it? Is there some sort of waiting period?

09:35 BOYKIS: Unfortunately, I think the money had already been dispersed by that point.

09:40 KENNEDY: I can easily see something, well, you know like, the prize money will be sent out after. It may change the timing of that for sure in the future. Who knows? Wow, that's crazy. Do you know why he was fired? I mean they're just like "We don't want you." I mean h2o.ai they're kind of a, "We'll help you with your ai story." So, I guess you know they're probably like, "We don't want any of the negativity of that on our product."

10:11 BOYKIS: I think that's essentially it, and it was a pretty big competition in the data science community, and I think also once they had already started to look into it, in other places previously he talked about basically scraping data to game competitions as well. So, all of that stuff started to come out as well.

10:27 KENNEDY: Wow.

10:28 BOYKIS: I think they wanted to distance themselves.

10:30 KENNEDY: Yeah, I can imagine. Yikes. Okay, well thank you for sharing that. Now, before we get to the next one let me tell you about this weeks sponsor, Datadog, they're a cloud scale monitoring platform that unifies metrics logs and traces. Monitor your Python applications in real time, find bottlenecks with detailed flames graphs, trace requests as they travel across service boundaries, and their tracing client auto instruments popular frameworks like Django, Async IO, Flask, so you can quickly get started monitoring the health and performance of you Python apps. Do that with a fourteen day free trial and datadog will send you a complimentary t-shirt. Cool little datadog t-shirt. So, check them out at pythonbytes.fm/datadog. This next one kind of hits home for me because I have a ton of services and a lot of servers and websites and all these things working together, running on uWSGI , U-W-S-G-I, and I've had it running for quite a few years, it's got a lot of traffic. You know, we do I dunno, fourteen terabytes of traffic a month or maybe even more than that. So quite a bit of traffic going around these services and whatnot. So, it's been working fine, but I ran across this article by the engineers at Bloomberg. They talked about this thing called configuring uWSGI for production deployment, and I actually learned a lot from this article. So, I don't feel like I was doing too many things wrong, but there was a couple thing I'm like, "Oh, yeah, I should probably do that." And other stuff that is just really nice. So, I just want to run you through a couple things that I learned, and if you want to hear more about how we're using uWSGI you can check that out on Talk Python 215. Dan Bader and I swapped stories about how we're running our various things. You know, Talk Python Training and realpython.com and whatnot. So this is guidance from Bloomberg's Engineering Structured Products Application Group. Whew, that's quite the title. And they decided to use uWSGI because it's really good for performance easy to work with. However, they said Micro WSGI is, as it's maturing some of the defaults that made sense when it was new, like in 2008, don't make sense anymore. The reason is partly just because the way people use these sites is different, or these servers is different. For example, doing proxies up in front of uWSGI with say... Nginx that used to not be so popular. So, they made these defaults built into the system that maybe don't make sense anymore. So what they did is they said, "We're going to go through and talk about all the things that we're going to override the defaults for and why." Unbit, the developer of uWSGI, is going to fix all of these bad defaults in the 2.1 release, but right now it's 2.0 as of this recording, so you're going to have to just, you know, hang in there or apply some of these changes. Now I do want to point out one thing. When I switched on a lot of these I did it one at a time and the way you get it to reload it's config is you say, relaunch the process, restart the process with system CTL like a Daemon management thing from linux. One of their recommendations is to use this flag "die-on-term" Which is for it to die on a different signal that it receives. And for whatever reason, maybe I'm doing it wrong, but whenever I turn that on it would just lock up and it would take about two minutes to restart the server because it would just hang until it eventually timed out and forcibly killed. That seems bad so I'm not using that, but I'll go quickly over the settings that I use that I thought were cool, here. You've got these complicated config files, if you want to make sure everything is validated you can say strict equals true. That's cool that will verify that everything that's typed in the file is accurate and is valid because that's kind of forgiven at the moment. Master is true is a good setting because this allows it to create worker processes and recycle them based on number of requests and so on. Something that's interesting, I didn't even realize you could do, tell if you knew this was possible in python apps, you can disable the GIL, the global interpreter lock, you can say, "You know what? For this python interpreter let's not have a GIL."

14:36 BOYKIS: Wow, how does that work?

14:37 KENNEDY: Well, it's, I mean people talk about having no GIL as "Oh! You can do all this cool concurrency and whatnot." But what it really means is you're basically guaranteeing you can only have one thread. So, if you try to launch, let's say, a background job on a uWSGI server and you don't pass enable threads is true, it's just going to not run. Because there's no GIL and there's no way to start it. So, that's something you want to have on. Vacuum equals true, This one I had off and I turned it on, and apparently this cleans temporary files and so on. Also, single interpreter, it used to be that uWSGI was more of an app server that might have different versions of Python and maybe Ruby as well, and this will just say, "No No. This is just the one version." A couple other ones, you can specify the name that shows up in top or glances. It'll say, give it your website name, and it'll say things like, that thing worker process one or that thing master process or whatnot. And so there's just a bunch of cool things in here with nice descriptions of why you want these features. So, if you are out there and you're running uWSGI give this a quick scan, it's really cool. Now, this next one is also pretty neat. This one comes from the people who did Spacy, right? What do they got going on?

15:47 BOYKIS: That's right, so this was just released a couple days ago. And it's called Thinc and they built it as a functional take on deep learning, and so basically there's, if you're familiar with deep learning, there's kind of two big competing frameworks right now, Tensorflow and PyTorch and MXNet is also in there. So, the idea of this library is that is abstracts away some of the boiler plate that you have to write for both Tensorflow and PyTorch. PyTorch, has a little bit less. But you end up writing a lot of the same kind of stuff and there's also some stuff that obfuscated away from you specifically some of the matrix operations that go on under the hood, and so what Thinc does is, it already runs on Spacy, which is an NLP library under the covers. So what the team did was they surfaced it so that other people could use it more generically in their projects and so it has that favorite thing that we love, it has type checking which is particularly helpful for tensors when you're trying to get stuff and you're not sure why it's not returning things. It has classes for PyTorch wrappers and for Tensorflow and you can intermingle the two if you want to. If you have two libraries that bridge the things. It has deep support for NumPy structures, which are kind of the underlying structures for deep learning. It operated in batches which is also a common feature of deep learning projects. So they process features and data in batches. And then it also, sometimes a problem that you have with deep learning is you are constantly tuning hyper parameters or the variables that you put into your model to figure out how long you're going to run it for, what size your images are going to be, and usually those are clustered in the beginning of your files like a dump or a dictionary or whatever. It has a special structure to handle those as well. So it basically hopes to make it easier and more flexible to do deep learning, especially if you're working with two different libraries, and if offers a nice higher level of abstraction on top of that. And the other cool thing is that they have already released all the examples and code that are available in Jupyter notebooks on their Github repo. So, I'm definitely going to be taking a closer look at that.

18:08 KENNEDY: Yeah, that's really cool. They have all these nice examples there and even buttons to open them in Colab which is pretty awesome. This looks great. And looks like it's doing some work with FastAPI as well I know they hired the person who's maintaining FastAPI, which is cool. Also, their prodigy project, so, yeah. This looks like a really nice library that they put together. Cool, and Ines has been on the show before, Ines from explosion ai, appeared here is a guest co-host as well. Super cool.

18:38 BOYKIS: That's awesome.

18:39 KENNEDY: This next one I want to talk about, I know I'd love to get your opinion because you're more on the data science side of things, right? So this next one I want to tell folks about. This is another one from listeners, we talked about something that validates pandas, and pandas is like oh well you should also check out this thing. So this comes from Jacob Deppen, Thank you Jacob for sending this in. And so, it's pandas-vet. What it is, is a plugin for flake8 that checks pandas's code and it's this opinionated take on how you should use pandas. They say one of the challenges is that if you go and search on Stack Overflow or other tutorials or even maybe video courses. They might show you how to do something with pandas but maybe that's a deprecated way of working with pandas, or some sort of old API and there's a better way. So the idea is to make pandas more friendly for newcomers by trying to focus on best practices and saying, "Don't do it that way. Do it this way." You know, read CSV it has so many parameters, what are you doing, here's how you use it. Things like that. So, this is based on a talk or this linter was created the idea was sparked by Ania Kapuścińska. Sorry, I'm sure I blew that name, but at PyCascades 2019, in Seattle, Lint your code responsibly, I'll link to that as well. It's kind of cool to see the evolution. Ania give a talk at PyCascades and then this person's like, "Oh, this is awesome. I'm going to actually turn this into a flake8 plugin." And so on. What are your thoughts on this? Do you like this idea?

20:10 BOYKIS: Yeah, I'm a huge fan of it. I think in general there's been kind of this, I don't want to say culture war about whether notebooks are good or bad. And there was recently a paper released. I want to say, not a paper but a blogpost a couple days ago about how you should never use notebooks. There was a talk by Joel Grus last year about what all the things that notebooks are bad with. I think they have their place and I think this is one of the ways you can have, I want to say guard rails around them. And help people do things. I like the very opinionated warning they have here which is that def is a bad variable name. Be kinder to yourself because that's always true you always start with the default of def and then you end up with thirty four or thirty five of them. I joke about this on twitter all the time. But it's true so that's a good one. Iloc and the dot I-x is always a point of confusion, so it's good that they have that. And then the pivot table one is preferred to pivot around stack. So there's a lot of places, pandas is fantastic, but there's a lot of these places where you have old APIs, you have new APIs, you have people who usually are both new to python and programming at the same time coming in and using these. So this is a good set of guard rails to help write better code if you're writing it in a notebook.

21:24 KENNEDY: Yeah, that's super cool. Do you know, is there a way to make flake8 run in a notebook automatically?

21:29 BOYKIS: I don't know.

21:30 KENNEDY: You probably can.

21:31 BOYKIS: It probably wouldn't be too hard. It's interesting that you ask that because that's generally not something you would do with notebooks, but maybe this kind of stuff will push it in the direction of being more what we consider quote-un-quote mainstream webdev or backend programming.

21:49 KENNEDY: Yeah, cool, well I definitely think it's nice. If I were getting started with pandas. I would give this a check. You also, if you're getting started with pandas, you may also be getting started with NumPy, right?

21:59 BOYKIS: Yep, so NumPy is the backbone of numerical computing in Python. So I talked about Tensorflow, Pytorch, Machine Learning in the previous stories. All of that kind of rests on the work and data structures that NumPy created. So Pandas, scikit-learn, Tensorflow, PyTorch, they all lean heavily, if not directly, depend on the core concepts. Which include matrix operation through the NumPy array. Also known as a ndarray. The problem was with ndarray's is that they're fantastic but their documentation was little bit hard for newcomers. So Anne Bonner wrote a whole new set of documentation for people that are both new to Python and scientific programming, and that's included in the NumPy docs themselves. Before if you wanted to find out what arrays were, how they work, you could go to the section and you could find out the parameters and attributes and all the methods of that class, but you wouldn't find out how or why you would use it. So, this documentation is fantastic because it has an explanation of what they are. It has visuals of what happens when you perform certain operations on arrays. And it has a lot of really great resources if you're just getting started with NumPy. I strongly recommend if you're doing any type of data work in Python, especially with pandas, that you become familiar with NumPy arrays and this makes it really easy to do so.

23:16 KENNEDY: Nice. It has things like, "How do I convert a 1d array to a 2d array, or what's the difference between a Python list and a NumPy array and whatnot." Yeah, it looks really helpful. I like the why. It's often missing. You'll see, "Use this function for this and here are the parameters." Sometimes they'll describe them. Sometimes not. You know, and then it's just like, "Well, maybe this is what I want? Stack Overflow seemed to indicate this is what I want. I'm not sure. I'll give it it a try." Right, so I like the little extra guidance behind it. That's great.

23:49 BOYKIS: Yeah, it does a really good job of warranting you.

23:50 KENNEDY: Cool. Alright, well, Vicki, those are our main topics for the week, but we got a few extra quick items just to throw in here at the end. I'll let you go first with yours.

23:50 BOYKIS: Sure this is just a bit of blatant self promotion about who I am. I'm a data scientist. On the side I write a news letter that's called Normcore Tech. And it's about all the things that I've not seen covered in the mainstream media. And it's just a random hodge-podge of stuff. It ranges from anything, like machine learning. How the datasets got created initially for NLP. I've written about Elon Musk memes. I wrote about the recent raid of the Nginx office in great detail and what happened there. So, there's a free version that goes out once a week and paid subscribers get access to one more paid newsletter per week. But really it's more about the idea of supporting in-depth writing. So it's just vicki.substack.com .

23:50 KENNEDY: Cool. Well, that's a neat newsletter and I'm a subscriber, so very, very nice. I've a quick one for you all out there. And maybe two actually. One, pip 20.0 was released. So, not a huge change obviously pip is compatible with the stuff that it did before and what not, but it does a couple of nice things, and I think this going to be extra nice for beginners because it's so challenging. You go to a tutorial and it says, Alright, the first thing you've got to do to run whatever. I want to run Flask, or I want to run Jupyter, as you say, pip install flask, or pip install jupyter. It says, you do not have permission to write to wherever you were going to install. Right? Depending on your system. And, so, if that happens now in Pip 20.0, it will install as if --user was passed into the user profile. That's cool, huh?

23:50 BOYKIS: That's really neat.

23:50 KENNEDY: Yeah, so that's great. And cache wheels are built from Github requirements and a couple of other things. So, yeah, nothing major but nice to have that there. And then, also, I previously gone on a bit of rant that I was bugged that homebrew, which is how I put Python on my Mac, was great for installing Python 3 Until 3.7. So, if you just, it's even better because if you just say, brew install python, that means install Python 3 not Legacy Python. Which is great, but that sort of stopped working, it still works, but it installs Python 3.7. So, that was kind of like. "Oh, sad face." But, I'm sorry I forget the person who sent this over on Twitter, but one of the listeners sent in a message that said, you can brew install python@3.8, and that works.

23:50 BOYKIS: Is it safe to brew again? I've just started downloading directly from Python.

23:50 KENNEDY: I know. Exactly. So, I'm trying it today and so far it's going well. So, I'm really excited that on MacOS we can probably get the latest Python. Even if you have the say the version. I just have an alias that re-aliases what python means in my zsh.rc file and it'll just say, "You know if you type python that means python3.8", for now. Anyway, fingers crossed, looks like it's good and hopefully it just keeps updating itself. I suspect it will at least within the 3.8 branch. Alright, you ready to close this out with a joke? I'm sure you've heard the type of joke. You know, "Mathematician and a Physicist walk into a bar and..." Right? Well, some weird thing about numbers and space. So this one is kind of like that one. It's about search engine optimization. So, an SEO expert walks into a bar, bars, pub, public house, Irish pub, tavern, bartender, beer, liquor, wine, alcohol, spirits... and so on. It's bad huh?

23:50 BOYKIS: I like that. That's nice.

23:50 KENNEDY: Yeah, it's so true. You remember how blatant websites used to be like ten years ago? They would just have like like a massive bunch of just random keywords at the bottom. Just you know like it seems...

23:50 BOYKIS: Yeah, and sometimes they would be in white in white text.

23:50 KENNEDY: Yes, exactly, white on white.

23:50 BOYKIS: So you couldn't see them. Then if you highlight it you see a whole three paragraphs.

23:50 KENNEDY: Here's where the SEO hacker went. I don't think that works so well anymore. Yeah, it's a good joke nonetheless. And, Vicki, it's been great to have you here. Thanks so much for filling in for Brian, and sharing the data science view of the world with us.

23:50 BOYKIS: Thanks for having me.

23:50 KENNEDY: You bet. Bye.

23:50 BOYKIS: Bye.

23:50 KENNEDY: Thank you for listening to Python Bytes. Follow the show on Twitter via @pythonbytes . That's pythonbytes as in B-Y-T-E-S and get the full show notes at pythonbytes.fm If you have a news item you want featured just visit pythonbytes.fm and send it our way. We're always on the lookout for sharing something cool. On behalf of myself and Brian Okken this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.

Back to show page