Brought to you by Michael and Brian - take a Talk Python course or get Brian's pytest book

#167: Cheating at Kaggle and uWSGI in prod

Published Mon, Feb 3, 2020, recorded Wed, Jan 29, 2020

Sponsored by Datadog: pythonbytes.fm/datadog

Special guest: Vicki Boykis: @vboykis

Michael #1: clize: Turn functions into command-line interfaces

  • via Marcelo
  • Follow up from Typer on episode 164.
  • Features
    • Create command-line interfaces by creating functions and passing them to [clize.run](https://clize.readthedocs.io/en/stable/api.html#clize.run).
    • Enjoy a CLI automatically created from your functions’ parameters.
    • Bring your users familiar --help messages generated from your docstrings.
    • Reuse functionality across multiple commands using decorators.
    • Extend Clize with new parameter behavior.
  • I love how this is pure Python without its own API for the default case

Vicki #2: How to cheat at Kaggle AI contests

  • Kaggle is a platform, now owned by Google, that allows data scientists to find data sets, learn data science, and participate in competitions
  • Many people participate in Kaggle competitions to sharpen their data science/modeling skills
  • Recently, a competition that was related to analyzing pet shelter data resulted in a huge controversy
  • Petfinder.my is a platform that helps people find pets to rescue in Malaysia from shelters. In 2019, they announced a collaboration with Kaggle to create a machine learning predictor algorithm of which pets (worldwide) were more likely to be adopted based on the metadata of the descriptions on the site.
  • The total prize offered was $25,000
  • After several months, a contestant won. He was previously a Kaggle grandmaster, and won $10k.
  • A volunteer, Benjamin Minixhofer, offered to put the algorithm in production, and when he did, he found that there was a huge discrepancy between first and second place
  • Technical Aspects of the controversy:
    • The data they gave asked the contestants to predict the speed at which a pet would be adopted, from 1-5, and included input features like type of animal, breed, coloration, whether the animal was vaccinated, and adoption fee
    • The initial training set had 15k animals and the teams, after a couple months, were then given 4k animals that their algorithms had not seen before as a test of how accurate they were (common machine learning best practice).
    • In a Jupyter notebook Kernel on Kaggle, Minixhofer explains how the winning team cheated
    • First, they individually scraped Petfinder.my to find the answers for the 4k test data
    • Using md5, they created a hash for each unique pet, and looked up the score for each hash from the external dataset - there were 3500 overlaps
    • Did Pandas column manipulation to get at the hidden prediction variable for every 10th pet and replaces the prediction that should have been generated by the algorithm with the actual value
    • Using mostly: obfuscated functions, Pandas, and dictionaries, as well as MD5 hashes
  • Fallout:

Michael #3: Configuring uWSGI for Production Deployment

  • We run a lot of uWSGI backed services. I’ve spoken in-depth back on Talk Python 215: The software powering Talk Python courses and podcast about this.
  • This is guidance from Bloomberg Engineering’s Structured Products Applications group
  • We chose uWSGI as our host because of its performance and feature set. But, while powerful, uWSGI’s defaults are driven by backward compatibility and are not ideal for new deployments.
  • There is also an official Things to Know doc.
  • Unbit, the developer of uWSGI, has “decided to fix all of the bad defaults (especially for the Python plugin) in the 2.1 branch.” The 2.1 branch is not released yet.
  • Warning, I had trouble with die-on-term and systemctl
  • Settings I’m using:
    # This option tells uWSGI to fail to start if any parameter
    # in the configuration file isn’t explicitly understood by uWSGI.
    strict = true
    
    # The master uWSGI process is necessary to gracefully re-spawn
    # and pre-fork workers, consolidate logs, and manage many other features
    master = true
    
    # uWSGI disables Python threads by default, as described in the Things to Know doc.
    enable-threads = true
    
    # This option will instruct uWSGI to clean up any temporary files or UNIX sockets it created
    vacuum = true
    
    # By default, uWSGI starts in multiple interpreter mode
    single-interpreter = true
    
    # Prevents uWSGI from starting if it is unable to find or load your application module
    need-app = true
    
    # uWSGI provides some functionality which can help identify the workers
    auto-procname = true
    procname-prefix = pythonbytes-
    
    # Forcefully kill workers after 60 seconds. Without this feature,
    # a stuck process could stay stuck forever.
    harakiri = 60
    harakiri-verbose = true
    

Vicki #4: Thinc: A functional take on deep learning, compatible with Tensorflow, PyTorch, and MXNet

  • A deep learning library that abstracts away some TF and Pytorch boilerplate, from Explosion
  • Already runs under the covers in SpaCy, an NLP library used for deep learning
  • type checking, particularly helpful for Tensors: PyTorchWrapper and TensorFlowWrapper classes and the intermingling of both
  • Deep support for numpy structures and semantics
  • Assumes you’re going to be using stochastic gradient descent
  • And operates in batches
  • Also cleans up the configuration and hyperparameters
  • Mainly hopes to make it easier and more flexible to do matrix manipulations, using a codebase that already existed but was not customer-facing.
  • Examples and code are all available in notebooks in the GitHub repo

Michael #5: pandas-vet

  • via Jacob Deppen
  • A plugin for Flake8 that checks pandas code
  • Starting with pandas can be daunting.
  • The usual internet help sites are littered with different ways to do the same thing and some features that the pandas docs themselves discourage live on in the API.
  • Makes pandas a little more friendly for newcomers by taking some opinionated stances about pandas best practices.
  • The idea to create a linter was sparked by Ania Kapuścińska's talk at PyCascades 2019, "Lint your code responsibly!"

Vicki #6: NumPy beginner documentation

  • NumPy is the backbone of numerical computing in Python: Pandas (which I mentioned before), scikit-learn, Tensorflow, and Pytorch, all lean heavily if not directly depend on its core concepts, which include matrix operations through a data structure known as a NumPy array (which is different than a Python list) - ndarray
  • Anne Bonner wrote up new documentation for NumPy that introduces these fundamental concepts to beginners coming to both Python and scientific computing
  • Before, you went directly to the section about arrays and had to search through it find what you wanted. The new guide, which is very nice, includes a step-by-step on how arrays work, how to reshape them, and illustrated guides on basic array operations.

Extras:

Vicki

  • I write a newsletter, Normcore Tech, about all things tech that I’m not seeing covered in the mainstream tech media. I’ve written before about machine learning, data for NLP, Elon Musk memes, and Nginx.
  • There’s a free version that goes out once a week and paid subscribers get access to one more newsletter per week, but really it’s more about the idea of supporting in-depth writing about tech. vicki.substack.com

Michael:

  • pip 20.0 Released - Default to doing a user install (as if --user was passed) when the main site-packages directory is not writeable and user site-packages are enabled, cache wheels built from Git requirements, and more.
  • Homebrew: brew install python@3.8

Joke:

An SEO expert walks into a bar, bars, pub, public house, Irish pub, tavern, bartender, beer, liquor, wine, alcohol, spirits...

Episode Transcript

Collapse transcript

00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to

00:04 your earbuds. This is episode 167, recorded January 29th, 2020. I'm Michael Kennedy,

00:11 and Brian Okken is away. We miss you, Brian, but I have a very special guest to join me,

00:16 Vicki Boykus. Welcome, Vicki.

00:18 Thanks for having me.

00:19 Yeah, it's great to have you here. I'm excited to get your take on this week's Python news. I

00:23 know you found some really interesting and controversial ones that we're going to jump

00:27 into, and that would be great. Also, great is Datadog. They're sponsoring this episode,

00:31 so check them out at pythonbytes.fm/Datadog. I'll tell you more about them later. Vicki,

00:37 let me kick us off with command line interface libraries, for lack of a better word. So back

00:44 on episode 164, so three episodes ago, I talked about this thing called Typer. Have you heard of

00:50 Typer? T-Y-P-E-R?

00:51 I have not, but I've heard of Click, so I'm curious to see how this differs from that even.

00:56 Yeah, yeah. So this is sort of a competitor to Click. Typer is super cool because what it does is it uses

01:03 native Python concepts to build out your CLI rather than attributes where you describe everything. So,

01:11 for example, you can have a function and you just say this function takes a name, colon,

01:16 stir to give it a type or an int or whatever, and then Typer can automatically use the type

01:23 type and the name of the parameters and stuff to generate like your help and the inbound arguments

01:28 and so on. So that's pretty cool, right? Yeah. Seems like a great excuse to start using type

01:32 annotations if you haven't yet. Yeah, exactly. Very, very nice that it leverages type annotations,

01:37 hence the name Typer, right? So our listeners are great. They always send in stuff that we haven't

01:41 heard about or you're like, I can't believe you didn't talk about this other thing. So Marcello sent in a

01:47 message and says, hey, you should talk about CLI's, C-L-I-Z-E, which turns functions into command line

01:54 interface. So CLI's is really cool. And it's very similar in regard to how it works as to Typer. So what you do is you

02:04 create functions, you give them variables, you don't have to use the types in the sense that Typer does. But you have

02:10 positional arguments, and you have keyword only arguments. And you know, Python has that syntax that very few people use, but it's cool.

02:18 If you want to enforce it, where you can say here are some parameters to a function comma star comma, here are some more and the

02:26 stuff after the star has to be addressed as a keyword argument, right? Yeah, so it leverages that kind of stuff. So you can say, like their

02:34 example says, here's a Hello World function, and it takes a name, which has a default of none, and then star comma, no

02:40 capitalize is false, and it gives it a default value. So all you got to do to run it is basically import CLI's dot run and call

02:48 run on your function. And then what it does is it verifies those arguments about whether or not they're required. And then it'll

02:56 convert the keyword arguments to like dash dash, this or that. So like dash dash, no capitalize, will pass true to no capitalize. If you admit it, it'll

03:06 pass, you know, whatever the default is, I guess, so false. So there's like positional ones where you don't say the name, but then also this cool way of

03:13 adding on these, these --capitalize and so on. So it seems like a really cool, and pretty extensive library for building command line

03:21 interfaces. Yeah, so this seems like it'd be good if you have a lot of parameters that you have to pass in. I'm thinking,

03:26 specifically of some of the work that you would do in the cloud, like in the AWS command line. Yeah, yeah. Or similar?

03:31 Yeah, for sure. Another thing that's cool is it will take your doc strings and use those as help messages.

03:37 Oh, that's neat. Yeah, so you know, and like some editors, you can type triple quote, enter, and it'll generate, you know, here's the

03:44 summary of the method, and then here's the arguments, and you can put this, or you can just write them out, of course. And then here's the

03:50 descriptions about each parameter, those become help messages about each command in there. So it's really

03:58 nice. And I like how it uses just pure Python, sort of similar to Typer in that regard, that you don't put

04:04 like three or four levels of decorators on top of things and then reference other parts of that. You just say,

04:09 here's some Python code, I want to treat it as a command line interface, clys.run.

04:14 That is pretty cool. Yeah.

04:16 So there's now a lot of choices if you want to do command line interfaces.

04:19 Yeah, yeah, definitely. And click is good, and it's very popular in argparse as well. But I'm kind of a

04:26 fan of these pure Python ones that don't require me to go do a whole bunch of extra stuff. So yeah,

04:30 definitely loving that. You know what, I bet that Kaggle's not loving what you're talking about next.

04:35 Before we get into this.

04:37 Well, I think they might be, but...

04:39 Yeah, we'll see. Okay. Tell us about Kaggle and what the big news here is.

04:43 Yeah. So there was a dust up at Kaggle a couple weeks ago. So just as a little bit of background,

04:48 Kaggle is a platform that's now owned by Google that allows data scientists to find data sets,

04:54 to learn data science. And most importantly, it's probably known for letting people participate

04:59 in machine learning competitions. That's kind of how it gained its popularity and notoriety.

05:03 Yeah, that's how I know it.

05:04 Yep. And so people can sharpen their data science and modeling skills on it. So they recently,

05:09 I want to say last fall, hosted a competition that was about analyzing pet shelter data. And this

05:16 resulted in enormous controversy. So what happened is there's this website that's called petfinder.my

05:23 that helps people find pets to rescue in Malaysia from shelters. And in 2019, they announced a

05:30 collaboration with Kaggle to create a machine learning predictor algorithm, which pets would be most

05:36 likely to be adopted based on the metadata descriptions on the site. So if you go to petfinder.my,

05:42 you'd see that they'll have a picture of the pet and then a description, how old they are,

05:47 and some other attributes about them.

05:49 Right. Were they vaccinated or things like that, right? Sort of, you might think, well,

05:54 if they're vaccinated or they're neutered or spayed, they may be more likely to be adopted,

05:59 but you don't necessarily know, right? So that was kind of some, like, what are the important factors

06:03 was this whole competition, right?

06:04 Yeah. The goal was to help the shelters write better descriptions so that pets would be adopted

06:10 more quickly. So after several months, so they held the competition for several months and there was a

06:16 contestant that won and he was previously what was called a Kaggle Grandmaster. So he'd won a lot of

06:21 different stuff on Kaggle before and he won $10,000 in prize money. But then what happened is they

06:27 started to validate all of his data. Because when you do a Kaggle competition, you then submit

06:33 all of your data and all of your results and your notebooks and your code.

06:37 Like how you trained your models and stuff like that, right?

06:40 Yeah.

06:40 Okay.

06:41 Yeah. All of that stuff. And then what happened was a pet finder wanted to put this model into

06:46 production. So you initially have something like a Jupyter or a Colab notebook in this case.

06:51 And the idea is that now you want to be able to integrate it into the pet finder website.

06:56 So they can actually use these predictors to fine tune how they post the content. And so when a

07:03 volunteer who was Benjamin Minnicoffer offered to put the algorithm into production and he started

07:09 looking at it, he found that there was a huge discrepancy between the first and second place

07:14 entrance in the contest. And so what happened was, so a little to get more into the technical

07:20 aspect, the data they gave to the contestants asked them to predict the speed at which a pet

07:24 would be adopted from one to five and included some of the features you talked about, like animal

07:28 breed coloration, all that stuff. The initial training set had 15,000 animals. And then after a

07:35 couple of months, the contestants were given 4,000 animals that had not been seen before as a test of

07:41 how accurate they were. So what the winner did was he actually scraped basically most of the website so

07:48 that he got that 4,000 set, the validation set also. And he had the validation set in his notebook.

07:56 So basically what he did was he used the MD5 library to create a hash for each unique pet. And then he looked

08:05 up the adoption score for each of those pets, basically when they were adopted from that external

08:10 data set. And there were about 3,500 that had overlaps with the validation set. And then he did a column

08:18 manipulation in pandas to get at the hidden prediction variable for every 10th pet, not every single pet,

08:23 but every 10th pet. So it didn't look too obvious.

08:25 Right. So he gave himself like a 10% head start or advantage or something like that.

08:30 Exactly. And he replaced the prediction with, that should have been generated by the algorithm with

08:36 the actual value. And then he did a dictionary lookup between the initial MD5 hash and the value

08:42 of the hash. And this was all of obfuscated in a separate function that happened in his data.

08:48 Wow. And so they must've been looking at this going, what does the MD5 hash of the pet attributes

08:53 have to do with anything? You know what I mean? Right. It's the, the hashes are meant to obscure

08:58 stuff, right?

08:59 Right. Yeah.

09:00 So what was the fallout?

09:01 So the fallout was this guy worked at h2o.ai. And so he was fired from there and Kaggle also issued

09:08 an apology where they explained exactly what happened. And they expressed the hope that this didn't mean

09:14 that every contest going forward would be viewed with suspicion for more openness and for collaboration

09:20 going forward. Wow. And it was an amazing catch. Yeah. That's such a good catch. I'm so,

09:24 so glad that Benjamin did that. I've got the whole deal here. Now did Kaggle actually end up paying

09:31 him the 10,000 before they caught it? Is there like some sort of waiting period? Unfortunately,

09:36 I think the money had already been dispersed by that point. Yeah. I can easily see something. Well,

09:42 you know, like the prize money will be sent out after a, you know, very deep, it may change the

09:49 timing of that for sure in the future, who knows, but wow, that's crazy. Do you know why he was fired?

09:54 I mean, they're just like, we don't want you to say, I mean, h2o.ai, they're kind of a,

10:01 we'll help you with your AI story. So I guess, you know, they're probably just like, we don't want

10:07 any of the negativity of that on our product. Yeah, I think that's essentially it. And it was

10:13 a pretty big competition in the data science community. And I think also once they'd started

10:17 to look into it in other places, previously, he talked about just basically scraping data to gain

10:24 competitions as well. So all of that stuff started to come out as well. I think they wanted to distance

10:29 themselves. Yeah, I can imagine. Yeah. Okay. Well, thank you for sharing that. Now, before we get to the

10:33 next one, let me tell you about this week's sponsor, Datadog. They're a cloud scale monitoring

10:38 platform that unifies metrics, logs and traces. Monitor your Python applications in real time,

10:44 find bottlenecks with detailed flame graphs, trace requests as they travel across service boundaries,

10:49 and they're tracing client auto instruments, popular frameworks like Django, AsyncIO, Flask,

10:56 so you can quickly get started monitoring the health and performance of your Python apps.

11:00 Do that with a 14 day free trial and Datadog will send you a complimentary t-shirt, cool little

11:06 Datadog t-shirt. So check them out at pythonbytes.fm/Datadog. This next one kind of hits home for me

11:12 because I have a ton of services and a lot of servers and websites and all these things working

11:17 together, running on micro WSGI, UWSGI. And I've had it running for quite a few years. It's got a lot of

11:25 traffic, you know, we do like, I don't know, 14 terabytes of traffic a month or maybe even more

11:31 than that. So quite a bit of traffic going around these services and whatnot. So it's been working

11:37 fine. But I ran across this article by the engineers at Bloomberg. So they talked about this thing called

11:43 configuring micro WSGI for production deployment. And I actually learned a lot from this article. So

11:50 I don't feel like I was doing too many things wrong, but there was a couple of things I'm like,

11:53 oh yeah, I should probably do that. And other stuff just that is really nice. So I just want to run you

11:58 through a couple of things that I learned. And if you want to hear more about how we're using micro WSGI,

12:03 you can check that out on Talk Python 215. Dan Bader and I swap stories about how we're running our

12:10 thing, our various things, you know, Talk Python training and realpython.com and whatnot.

12:16 So this is guidance from Bloomberg's engineering structured products application group.

12:22 That's quite the title. And they decided to use micro WSGI because it's really, you know,

12:27 good for performance, easy to work with. However, they said micro WSGI is, as it's maturing,

12:34 some of the defaults that made sense when it was new, like in 2008, don't make sense anymore.

12:40 The reason is partly just because the way people use these sites is different or these servers is

12:46 different. For example, doing proxies up in front of micro WSGI with say Nginx, that used to not be

12:53 so popular. So they made these defaults built into the system that maybe don't make sense anymore.

12:58 And so what they did is we're going to go through and they said, we're going to go through and talk

13:02 about all the things that we're going to override the defaults for. And why unbit the developer

13:07 micro WSGI is going to fix all of these bad defaults in the 2.1 release. But right now it's 2.0

13:14 as of this recording. So you're going to have to just, you know, hang in there or apply some of these

13:20 changes. Now, I do want to point out one thing. When I switched on a lot of these, I did them one at a

13:25 time. And the way you get it to reload its config is you say, relaunch the process, restart the process

13:31 with a system CTL, like a daemon management thing from Linux. And one of their recommendations

13:37 is to use this flag die on term, which is for it to die on a different signal that it receives.

13:44 And for whatever reason, maybe I'm doing it wrong. But whenever I turn that on, it would just lock up

13:49 and it would take about two minutes to restart the server because it would just hang until it

13:54 eventually timed out. It was like forcefully killed. So that seems bad. So I'm not using that.

13:58 But I'll go quickly over the settings that I use that I thought were cool here. So there's,

14:03 you've got these complicated config files. If you want to have make sure everything's validated,

14:07 you can say strict equals true. That's cool. That will verify that everything that's typed in the

14:11 file is accurate and is valid because it's kind of forgiven at the moment. Master is true is a good

14:17 setting because this allows it to create worker processes and recycle them based on number of requests

14:22 and so on. Something that's interesting, I didn't even realize you could do. Maybe tell me if you knew

14:28 this was possible in Python apps. You can disable the GIL, the global interpreter lock. You can say,

14:33 you know what, for this Python interpreter, let's not have a GIL.

14:36 Wow. How does that work?

14:37 Yeah. Well, it's, I mean, people talk about having no GIL. It's like, oh, you can do all this cool

14:41 concurrency and whatnot. But what it really means is you're basically guaranteeing you can only have

14:46 one thread. So if you try to launch, say, a background job on a micro WSGI server and you don't

14:52 pass enable threads, it's true. It's just going to not run because there's no GIL and there's no way to start it.

14:57 That's something you want to have on. Vacuum equals true. This one I had off and I turned it on.

15:03 Apparently this cleans up like temporary files and so on. Also a single interpreter. It used to be that

15:07 micro WSGI was more of an app server that might have different versions of Python and maybe Ruby as well.

15:12 And this will just say, no, no, it's just the one version. A couple other ones. You can specify the

15:17 name that shows up in like top or glances. So it'll say like, you can give it, say your website name,

15:24 and it'll say things like that thing worker process one or that thing master process or whatnot. And so

15:30 there's just a bunch of cool things in here with nice descriptions of why you want these features.

15:35 So if you are out there and you're running micro WSGI, give this a quick scan. It's really cool.

15:40 Now, this next one also is pretty neat. So this one comes from the people who did spaCy, right? What do

15:46 they got going on?

15:47 Yep, that's right. So this was just released a couple of days ago and it's called Think and they bill it as

15:54 a functional take on deep learning. And so basically what it there's, if you're familiar with deep

15:59 learning, there's kind of two big competing frameworks right now. TensorFlow and PyTorch and

16:05 MXNet is also in there. So the idea of this library is that it abstracts away some of the boilerplate

16:11 that you have to write for both TensorFlow and PyTorch. PyTorch has a little bit less. TensorFlow

16:15 with Paris on top also has a little bit less, but you end up writing a lot of the same kind of stuff.

16:22 And there's also some stuff that's obfuscated away from you, specifically some of the matrix

16:28 operations that go on under the hood. And so what Think does is, so it already runs on spaCy, which is

16:36 an NLP library under the covers. So what the team did was they surfaced it so that other people could

16:42 use it more generically in their projects. And so it has that favorite thing that we love. It has type

16:49 checking, which is particularly helpful for tensors when you're trying to get stuff and you're not sure

16:55 why it's not returning things. It has classes for PyTorch wrappers and for TensorFlow. And you can

17:01 intermingle the two if you want to, if you have two libraries that bridge things. It has deep support

17:07 for NumPy structures, which are the kind of the underlying structures for deep learning.

17:12 It operates in batches, which is also a common feature of deep learning projects. So they process

17:19 features and data in batches. And then it also sometimes a problem that you have with deep

17:25 learning is you're constantly tuning hyperparameters or the variables that you put into your model to

17:31 figure out how long you're going to run it for, how many training epochs you're going to have, what size

17:36 your images are going to be. Usually those are those clustered in the beginning of your file is kind of

17:41 like a dump or a dictionary or whatever. It has a special structure to handle those as well. So it basically

17:47 hopes to make it easier and more flexible to do deep learning, especially if you're working with

17:53 two different libraries and it offers a nice higher level abstraction on top of that. And the other

17:58 cool thing is, is they have already released all the examples and code that are available in Jupyter

18:04 notebooks on their GitHub repo. So I'm definitely going to be taking a closer look at that.

18:08 Yeah, that's really cool. They have all these nice examples there and even buttons to open them in

18:12 Colab, which is, yeah, that's pretty awesome. This looks great. And it looks like it's doing some work

18:19 with FastAPI as well. I know they hired a person who's maintaining FastAPI, which is cool. Also,

18:25 their Prodigy project. So yeah, this looks like a really nice library that they put together. Cool. And

18:31 Ines has been on the show before. Ines from Explosion AI here as a guest co-host as well. Super

18:37 cool. That's awesome. Yeah. This next one I want to talk about, you know, I'd love to get your opinion

18:41 because you're more on the data science side of things, right? Yeah. Yeah. So this next one, I want

18:45 to tell folks about, this is another one from listeners. You know, we talked about something that

18:51 validates Panda Panda's and like, oh, you should also check out this thing. So this comes from

18:55 Jacob Deppin. Thank you, Jacob, for sending us in. And so it's pandas dash vet. And what it is,

19:02 is a plugin for Flake 8 that checks Panda's code. And it's this opinionated take on how you should use

19:09 Panda's. They say one of the challenges is that the, if you go and search on Stack Overflow or other

19:16 tutorials, or even maybe video courses, they might show you how to do something with Panda's,

19:20 but maybe that's a deprecated way of working with pandas or some sort of old API. And there's,

19:26 there's a better, better way. So the idea is to make pandas more friendly for newcomers by trying

19:32 to focus on best practices and saying, don't do it that way. Do it this way. You know, read CSV. It

19:36 has so many parameters. What are you doing? Here's how you use it. Things like that. So this is based on a

19:43 talk or this linter was created. The idea was sparked by a talk by Ania Kupsinski.

19:50 Sorry, I'm sure I blew that name bad, but at PyCascades 2019 in Seattle,

19:55 I want your code responsibly. So I'll link to that as well. So it's kind of cool to see the evolution.

19:59 Like Ania gave a talk at PyCascades and then this person's like, oh, this is awesome. I'm going to

20:05 actually turn this into a Flake 8 plugin and so on. What are your thoughts on this? Do you like this idea?

20:10 Yeah, I'm a huge fan of it. I think in general, there's been kind of like this,

20:13 I wouldn't want to say culture war about whether notebooks are good or bad. And there was recently

20:18 a paper released, I want to say, not a paper, but a blog post a couple of days ago about how you should

20:24 never use notebooks. There was a talk by Joel Gruse last year about all the things that notebooks are

20:31 bad with. I think they have their place. And I think this is one of the ways you can have,

20:36 I want to say, guardrails around them and help people do things. I like the very opinionated

20:42 warning that they have here, which is that DF is a bad variable name. Be kinder to yourself,

20:47 because that's always true. You always start with the default of DF and then you end up with

20:51 34, 35 of them. I joke about this on Twitter all the time, but it's true. So that's a good one.

20:56 The loc and the .ix and loc, iloc is always a point of confusion. So it's good that they have that.

21:03 And then the pivot table one is preferred to pivot or on stack. So there's a lot of places. So Pandas is

21:08 fantastic, but there's a lot of these places where you have old APIs, you have new APIs, you have people

21:14 who usually are both new to Python and programming at the same time coming in and using these. So this

21:20 is a good set of guardrails to help write better code if you're writing it in a notebook.

21:24 Oh yeah, that's super cool. Do you know, is there a way to make Flake 8 run in the notebook

21:28 automatically? I don't know. You probably can, yeah. It probably wouldn't be too hard.

21:32 Yeah, but I don't know. Yeah, but I think it's interesting that you ask that because that's not

21:36 generally that's something you would do with notebooks. But maybe this kind of stuff will push it in the

21:41 direction of being more like what we consider quote unquote mainstream or just web dev or

21:48 backend programming. Yeah, cool. Well, I definitely think it's nice if I were getting started with

21:52 Pandas, give this a check. You also, if you're getting started with Pandas, you may also be getting

21:57 started with NumPy, right? Yep. So NumPy is the backbone of numerical computing in Python.

22:03 So I talked about TensorFlow, PyTorch, machine learning in the previous stories. All of that

22:08 kind of rests on the work and the data structures that NumPy are created. So Pandas, Scikit-Learn,

22:15 TensorFlow, PyTorch, they all lean heavily, if not directly depend on the core concepts, which include

22:19 matrix operation through the NumPy array, also known as an ND array. The problem was with ND arrays is

22:27 they're fantastic, but the documentation was a little bit hard for newcomers. So Anne Bonner wrote

22:32 a whole new set of documentation for people that are both new to Python and scientific programming,

22:38 and that's included in the NumPy docs themselves. Before, if you wanted to find out what arrays were,

22:45 how they worked, you could go to the section and you could find out the parameters and attributes and

22:50 all the methods of that class. But you wouldn't find out how or why you would use it. And so this

22:55 documentation is fantastic because it has an explanation of what they are. It has visuals of

23:00 what happens when you perform certain operations on arrays. And it has a lot of really great resources if

23:05 you're just getting started with NumPy. The strong recommend for me, if you're doing any type of data

23:10 work in Python, especially with Pandas, that you become familiar with NumPy arrays. And this makes it

23:15 really easy to do so. Yeah, nice. It has things like, how do I convert a 1D array to a 2D array?

23:21 Or what's the difference between a Python list and a NumPy array and whatnot? Yeah, it looks really

23:28 helpful. I like the why. That's often missing. You'll see like, you do this, use this function for

23:34 this and here are the parameters. Sometimes they'll describe them, sometimes not. And then it's just like,

23:40 well, maybe this is what I want. Stack Overflow seemed to indicate this is what I want. I'm not

23:44 sure. I'll give it a try. Right. So I like the little extra guidance behind it. That's great.

23:47 Yeah, it does a really good job of orienting you. Cool. All right. Well, Vicki, those are our main

23:52 topics for the week. But we got a few extra quick items just to throw in here at the end. I'll let you

23:58 go first with yours. Sure. This is just a bit of blatant self-promotion about who I am. So I am a

24:04 data scientist on the side. I write a newsletter that's called Norm Cortac. And it's about all the things that I'm

24:10 not seeing covered in the mainstream media. And it's just a random hodgepodge of stuff. It ranges from

24:15 anything like machine learning, how the data sets got created initially for NLP. I've written about Elon Musk

24:23 memes. I wrote about the recent raid of the Nginx office in great detail and what happened there. So there's a free

24:30 version that goes out once a week and paid subscribers get access to one more paid newsletter per week. But really, it's more

24:36 about the idea of supporting in-depth writing. So it's just vicki.substack.com.

24:41 Cool. Well, that's a neat newsletter. And I'm a subscriber. So very, very nice. I have a quick one for you all

24:48 out there. And maybe two, actually. One, pip20.0 was released. So not a huge change. Obviously,

24:57 this is compatible with the stuff that I did before and whatnot. But it does a couple of nice things. And I

25:04 think this is going to be extra nice for beginners because it's so challenging. You go to a tutorial

25:10 and it says, all right, the first thing you got to do to run whatever, I want to run Flask or I want to

25:15 run Jupyter is you say pip install Flask or pip install Jupyter. And it says you do not have permission to,

25:21 you know, write to wherever those are going to install, right? Depending on your system. And so

25:27 if that happens now in pip20, it will install as if --user was passed into the user profile.

25:35 That's cool, huh?

25:36 That's really neat.

25:37 Yeah, yeah. So that's great. And cache wheels are built from GitHub requirements and a couple of

25:42 other things. So yeah, nothing major, but nice to have that there. And then also, I'd previously

25:48 gone on a bit of a rant saying I was bugged that homebrew, which is how I put Python on my Mac,

25:54 was great for installing Python 3 until 3.7. So if you just, it's even better because if you just say,

26:01 brew install Python, that means Python 3, which not legacy Python, which is great.

26:07 But that sort of stopped working. It still works, but it installs Python 3.7. So that was,

26:12 that was kind of like a, oh, sad face. But I'm sorry, I forget the person who sent this over on

26:18 Twitter. But when the listener sent in a message said, you can brew install Python at 3.8. And that

26:24 works. Why? That's not...

26:25 Is it safe to brew again? I've just started downloading directly from Python.

26:29 I know, exactly. Exactly. So I'm trying it today. And so far, it's going well. So I'm

26:35 really excited that on macOS, we can probably get the latest Python. Even if you got to say the

26:40 version, I just have an alias that re-aliases what Python means in my ZSHRC file. And it'll just say,

26:47 you know, if you type Python, that means Python 3.8 for now. Anyway, I'm pretty...

26:51 Yeah, fingers crossed. So it looks like it's good. And that's nice. Hopefully,

26:55 it just keeps updating itself. I suspect it will, at least within the 3.8 branch.

26:58 All right. You ready to close this out with a joke?

27:00 Yeah.

27:01 Yeah. So I'm sure you've heard the type of joke, you know, a mathematician and a physicist walk into a

27:07 bar and, right, well, some weird thing about numbers and space ensues. So this one is kind of like

27:14 that one. It's about search engine optimization. So an SEO expert walks into a bar, bars, pub,

27:21 public house, Irish pub, tavern, bartender, beer, liquor, wine, alcohol, spears, and so on.

27:27 It's bad, huh?

27:30 I like that. That's nice.

27:32 Yeah, it's so true. Like, you remember how blatant websites used to be like 10 years ago,

27:37 they would just have like a massive bunch of just random keywords at the bottom. Just, you know,

27:42 like it seems like...

27:43 Yeah. And sometimes they would be in white and white text.

27:45 Yes, exactly. White and white.

27:46 You can see them, but then if you highlight it, it would be like whole three paragraphs.

27:49 Here's where the SEO hacker went. I don't think that works so well anymore. But yeah,

27:55 it's a good joke nonetheless. And Vicky, it's been great to have you here. Thanks so much for

28:00 filling in for Brian and sharing the data science view of the world with us.

28:04 Thanks for having me.

28:05 You bet. Bye.

28:05 Bye.

28:06 Thank you for listening to Python Bytes. Follow the show on Twitter via at Python Bytes. That's

28:11 Python Bytes as in B-Y-T-E-S. And get the full show notes at pythonbytes.fm. If you have a news

28:17 item you want featured, just visit pythonbytes.fm and send it our way. We're always on the lookout for

28:22 sharing something cool. On behalf of myself and Brian Okken, this is Michael Kennedy. Thank you for

28:27 listening and sharing this podcast with your friends and colleagues.


Want to go deeper? Check our projects