WEBVTT

00:00:00.001 --> 00:00:04.400
Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to

00:00:04.400 --> 00:00:11.620
your earbuds. This is episode 167, recorded January 29th, 2020. I'm Michael Kennedy,

00:00:11.620 --> 00:00:16.720
and Brian Okken is away. We miss you, Brian, but I have a very special guest to join me,

00:00:16.720 --> 00:00:18.260
Vicki Boykus. Welcome, Vicki.

00:00:18.260 --> 00:00:19.160
Thanks for having me.

00:00:19.160 --> 00:00:23.820
Yeah, it's great to have you here. I'm excited to get your take on this week's Python news. I

00:00:23.820 --> 00:00:27.340
know you found some really interesting and controversial ones that we're going to jump

00:00:27.340 --> 00:00:31.900
into, and that would be great. Also, great is Datadog. They're sponsoring this episode,

00:00:31.900 --> 00:00:37.180
so check them out at pythonbytes.fm/Datadog. I'll tell you more about them later. Vicki,

00:00:37.180 --> 00:00:44.140
let me kick us off with command line interface libraries, for lack of a better word. So back

00:00:44.140 --> 00:00:50.080
on episode 164, so three episodes ago, I talked about this thing called Typer. Have you heard of

00:00:50.080 --> 00:00:51.800
Typer? T-Y-P-E-R?

00:00:51.800 --> 00:00:56.220
I have not, but I've heard of Click, so I'm curious to see how this differs from that even.

00:00:56.220 --> 00:01:03.680
Yeah, yeah. So this is sort of a competitor to Click. Typer is super cool because what it does is it uses

00:01:03.680 --> 00:01:11.440
native Python concepts to build out your CLI rather than attributes where you describe everything. So,

00:01:11.440 --> 00:01:16.780
for example, you can have a function and you just say this function takes a name, colon,

00:01:16.780 --> 00:01:23.020
stir to give it a type or an int or whatever, and then Typer can automatically use the type

00:01:23.020 --> 00:01:28.840
type and the name of the parameters and stuff to generate like your help and the inbound arguments

00:01:28.840 --> 00:01:32.700
and so on. So that's pretty cool, right? Yeah. Seems like a great excuse to start using type

00:01:32.700 --> 00:01:37.260
annotations if you haven't yet. Yeah, exactly. Very, very nice that it leverages type annotations,

00:01:37.260 --> 00:01:41.940
hence the name Typer, right? So our listeners are great. They always send in stuff that we haven't

00:01:41.940 --> 00:01:47.900
heard about or you're like, I can't believe you didn't talk about this other thing. So Marcello sent in a

00:01:47.900 --> 00:01:54.740
message and says, hey, you should talk about CLI's, C-L-I-Z-E, which turns functions into command line

00:01:54.740 --> 00:02:04.160
interface. So CLI's is really cool. And it's very similar in regard to how it works as to Typer. So what you do is you

00:02:04.160 --> 00:02:10.360
create functions, you give them variables, you don't have to use the types in the sense that Typer does. But you have

00:02:10.360 --> 00:02:18.660
positional arguments, and you have keyword only arguments. And you know, Python has that syntax that very few people use, but it's cool.

00:02:18.660 --> 00:02:26.280
If you want to enforce it, where you can say here are some parameters to a function comma star comma, here are some more and the

00:02:26.280 --> 00:02:34.080
stuff after the star has to be addressed as a keyword argument, right? Yeah, so it leverages that kind of stuff. So you can say, like their

00:02:34.080 --> 00:02:40.160
example says, here's a Hello World function, and it takes a name, which has a default of none, and then star comma, no

00:02:40.160 --> 00:02:48.780
capitalize is false, and it gives it a default value. So all you got to do to run it is basically import CLI's dot run and call

00:02:48.780 --> 00:02:56.160
run on your function. And then what it does is it verifies those arguments about whether or not they're required. And then it'll

00:02:56.160 --> 00:03:06.800
convert the keyword arguments to like dash dash, this or that. So like dash dash, no capitalize, will pass true to no capitalize. If you admit it, it'll

00:03:06.800 --> 00:03:13.560
pass, you know, whatever the default is, I guess, so false. So there's like positional ones where you don't say the name, but then also this cool way of

00:03:13.560 --> 00:03:21.040
adding on these, these --capitalize and so on. So it seems like a really cool, and pretty extensive library for building command line

00:03:21.040 --> 00:03:26.100
interfaces. Yeah, so this seems like it'd be good if you have a lot of parameters that you have to pass in. I'm thinking,

00:03:26.100 --> 00:03:31.780
specifically of some of the work that you would do in the cloud, like in the AWS command line. Yeah, yeah. Or similar?

00:03:31.780 --> 00:03:37.520
Yeah, for sure. Another thing that's cool is it will take your doc strings and use those as help messages.

00:03:37.520 --> 00:03:44.140
Oh, that's neat. Yeah, so you know, and like some editors, you can type triple quote, enter, and it'll generate, you know, here's the

00:03:44.140 --> 00:03:50.480
summary of the method, and then here's the arguments, and you can put this, or you can just write them out, of course. And then here's the

00:03:50.480 --> 00:03:58.500
descriptions about each parameter, those become help messages about each command in there. So it's really

00:03:58.500 --> 00:04:04.640
nice. And I like how it uses just pure Python, sort of similar to Typer in that regard, that you don't put

00:04:04.640 --> 00:04:09.700
like three or four levels of decorators on top of things and then reference other parts of that. You just say,

00:04:09.880 --> 00:04:14.980
here's some Python code, I want to treat it as a command line interface, clys.run.

00:04:14.980 --> 00:04:16.000
That is pretty cool. Yeah.

00:04:16.080 --> 00:04:19.940
So there's now a lot of choices if you want to do command line interfaces.

00:04:19.940 --> 00:04:26.040
Yeah, yeah, definitely. And click is good, and it's very popular in argparse as well. But I'm kind of a

00:04:26.040 --> 00:04:30.960
fan of these pure Python ones that don't require me to go do a whole bunch of extra stuff. So yeah,

00:04:30.960 --> 00:04:35.900
definitely loving that. You know what, I bet that Kaggle's not loving what you're talking about next.

00:04:35.900 --> 00:04:37.540
Before we get into this.

00:04:37.540 --> 00:04:39.380
Well, I think they might be, but...

00:04:39.380 --> 00:04:43.760
Yeah, we'll see. Okay. Tell us about Kaggle and what the big news here is.

00:04:43.760 --> 00:04:48.860
Yeah. So there was a dust up at Kaggle a couple weeks ago. So just as a little bit of background,

00:04:48.860 --> 00:04:54.340
Kaggle is a platform that's now owned by Google that allows data scientists to find data sets,

00:04:54.340 --> 00:04:59.140
to learn data science. And most importantly, it's probably known for letting people participate

00:04:59.140 --> 00:05:03.900
in machine learning competitions. That's kind of how it gained its popularity and notoriety.

00:05:03.900 --> 00:05:04.760
Yeah, that's how I know it.

00:05:04.760 --> 00:05:09.780
Yep. And so people can sharpen their data science and modeling skills on it. So they recently,

00:05:09.780 --> 00:05:16.820
I want to say last fall, hosted a competition that was about analyzing pet shelter data. And this

00:05:16.820 --> 00:05:23.640
resulted in enormous controversy. So what happened is there's this website that's called petfinder.my

00:05:23.640 --> 00:05:30.440
that helps people find pets to rescue in Malaysia from shelters. And in 2019, they announced a

00:05:30.440 --> 00:05:36.380
collaboration with Kaggle to create a machine learning predictor algorithm, which pets would be most

00:05:36.380 --> 00:05:42.060
likely to be adopted based on the metadata descriptions on the site. So if you go to petfinder.my,

00:05:42.060 --> 00:05:47.260
you'd see that they'll have a picture of the pet and then a description, how old they are,

00:05:47.260 --> 00:05:49.220
and some other attributes about them.

00:05:49.220 --> 00:05:54.880
Right. Were they vaccinated or things like that, right? Sort of, you might think, well,

00:05:54.880 --> 00:05:59.040
if they're vaccinated or they're neutered or spayed, they may be more likely to be adopted,

00:05:59.160 --> 00:06:03.060
but you don't necessarily know, right? So that was kind of some, like, what are the important factors

00:06:03.060 --> 00:06:04.360
was this whole competition, right?

00:06:04.360 --> 00:06:10.400
Yeah. The goal was to help the shelters write better descriptions so that pets would be adopted

00:06:10.400 --> 00:06:16.560
more quickly. So after several months, so they held the competition for several months and there was a

00:06:16.560 --> 00:06:21.920
contestant that won and he was previously what was called a Kaggle Grandmaster. So he'd won a lot of

00:06:21.920 --> 00:06:27.640
different stuff on Kaggle before and he won $10,000 in prize money. But then what happened is they

00:06:27.640 --> 00:06:33.420
started to validate all of his data. Because when you do a Kaggle competition, you then submit

00:06:33.420 --> 00:06:37.940
all of your data and all of your results and your notebooks and your code.

00:06:37.940 --> 00:06:40.400
Like how you trained your models and stuff like that, right?

00:06:40.400 --> 00:06:40.800
Yeah.

00:06:40.800 --> 00:06:41.000
Okay.

00:06:41.000 --> 00:06:46.100
Yeah. All of that stuff. And then what happened was a pet finder wanted to put this model into

00:06:46.100 --> 00:06:51.480
production. So you initially have something like a Jupyter or a Colab notebook in this case.

00:06:51.480 --> 00:06:56.380
And the idea is that now you want to be able to integrate it into the pet finder website.

00:06:56.380 --> 00:07:03.740
So they can actually use these predictors to fine tune how they post the content. And so when a

00:07:03.740 --> 00:07:09.840
volunteer who was Benjamin Minnicoffer offered to put the algorithm into production and he started

00:07:09.840 --> 00:07:14.560
looking at it, he found that there was a huge discrepancy between the first and second place

00:07:14.560 --> 00:07:20.280
entrance in the contest. And so what happened was, so a little to get more into the technical

00:07:20.280 --> 00:07:24.840
aspect, the data they gave to the contestants asked them to predict the speed at which a pet

00:07:24.840 --> 00:07:28.480
would be adopted from one to five and included some of the features you talked about, like animal

00:07:28.480 --> 00:07:35.580
breed coloration, all that stuff. The initial training set had 15,000 animals. And then after a

00:07:35.580 --> 00:07:41.600
couple of months, the contestants were given 4,000 animals that had not been seen before as a test of

00:07:41.600 --> 00:07:48.600
how accurate they were. So what the winner did was he actually scraped basically most of the website so

00:07:48.600 --> 00:07:56.660
that he got that 4,000 set, the validation set also. And he had the validation set in his notebook.

00:07:56.660 --> 00:08:05.180
So basically what he did was he used the MD5 library to create a hash for each unique pet. And then he looked

00:08:05.180 --> 00:08:10.860
up the adoption score for each of those pets, basically when they were adopted from that external

00:08:10.860 --> 00:08:18.120
data set. And there were about 3,500 that had overlaps with the validation set. And then he did a column

00:08:18.120 --> 00:08:23.900
manipulation in pandas to get at the hidden prediction variable for every 10th pet, not every single pet,

00:08:23.900 --> 00:08:25.900
but every 10th pet. So it didn't look too obvious.

00:08:25.900 --> 00:08:30.820
Right. So he gave himself like a 10% head start or advantage or something like that.

00:08:30.820 --> 00:08:36.140
Exactly. And he replaced the prediction with, that should have been generated by the algorithm with

00:08:36.140 --> 00:08:42.840
the actual value. And then he did a dictionary lookup between the initial MD5 hash and the value

00:08:42.840 --> 00:08:48.320
of the hash. And this was all of obfuscated in a separate function that happened in his data.

00:08:48.380 --> 00:08:53.740
Wow. And so they must've been looking at this going, what does the MD5 hash of the pet attributes

00:08:53.740 --> 00:08:58.520
have to do with anything? You know what I mean? Right. It's the hashes are meant to obscure

00:08:58.520 --> 00:08:59.240
stuff, right?

00:08:59.240 --> 00:09:00.040
Right. Yeah.

00:09:00.040 --> 00:09:01.340
So what was the fallout?

00:09:01.340 --> 00:09:08.340
So the fallout was this guy worked at h2o.ai. And so he was fired from there and Kaggle also issued

00:09:08.340 --> 00:09:14.620
an apology where they explained exactly what happened. And they expressed the hope that this didn't mean

00:09:14.620 --> 00:09:20.300
that every contest going forward would be viewed with suspicion for more openness and for collaboration

00:09:20.300 --> 00:09:24.800
going forward. Wow. And it was an amazing catch. Yeah. That's such a good catch. I'm so,

00:09:24.800 --> 00:09:31.120
so glad that Benjamin did that. I've got the whole deal here. Now did Kaggle actually end up paying

00:09:31.120 --> 00:09:36.340
him the 10,000 before they caught it? Is there like some sort of waiting period? Unfortunately,

00:09:36.340 --> 00:09:42.000
I think the money had already been dispersed by that point. Yeah. I can easily see something. Well,

00:09:42.320 --> 00:09:49.620
you know, like the prize money will be sent out after a, you know, very deep, it may change the

00:09:49.620 --> 00:09:54.220
timing of that for sure in the future, who knows, but wow, that's crazy. Do you know why he was fired?

00:09:54.220 --> 00:10:01.640
I mean, they're just like, we don't want you to say, I mean, h2o.ai, they're kind of a,

00:10:01.640 --> 00:10:07.480
we'll help you with your AI story. So I guess, you know, they're probably just like, we don't want

00:10:07.480 --> 00:10:13.240
any of the negativity of that on our product. Yeah, I think that's essentially it. And it was

00:10:13.240 --> 00:10:17.820
a pretty big competition in the data science community. And I think also once they'd started

00:10:17.820 --> 00:10:24.000
to look into it in other places, previously, he talked about just basically scraping data to gain

00:10:24.000 --> 00:10:29.180
competitions as well. So all of that stuff started to come out as well. I think they wanted to distance

00:10:29.180 --> 00:10:33.900
themselves. Yeah, I can imagine. Yeah. Okay. Well, thank you for sharing that. Now, before we get to the

00:10:33.900 --> 00:10:38.480
next one, let me tell you about this week's sponsor, Datadog. They're a cloud scale monitoring

00:10:38.480 --> 00:10:44.480
platform that unifies metrics, logs and traces. Monitor your Python applications in real time,

00:10:44.480 --> 00:10:49.140
find bottlenecks with detailed flame graphs, trace requests as they travel across service boundaries,

00:10:49.140 --> 00:10:56.280
and they're tracing client auto instruments, popular frameworks like Django, AsyncIO, Flask,

00:10:56.280 --> 00:10:59.600
so you can quickly get started monitoring the health and performance of your Python apps.

00:11:00.160 --> 00:11:06.060
Do that with a 14 day free trial and Datadog will send you a complimentary t-shirt, cool little

00:11:06.060 --> 00:11:12.260
Datadog t-shirt. So check them out at pythonbytes.fm/Datadog. This next one kind of hits home for me

00:11:12.260 --> 00:11:17.440
because I have a ton of services and a lot of servers and websites and all these things working

00:11:17.440 --> 00:11:25.460
together, running on micro WSGI, UWSGI. And I've had it running for quite a few years. It's got a lot of

00:11:25.460 --> 00:11:31.700
traffic, you know, we do like, I don't know, 14 terabytes of traffic a month or maybe even more

00:11:31.700 --> 00:11:37.120
than that. So quite a bit of traffic going around these services and whatnot. So it's been working

00:11:37.120 --> 00:11:43.160
fine. But I ran across this article by the engineers at Bloomberg. So they talked about this thing called

00:11:43.160 --> 00:11:50.180
configuring micro WSGI for production deployment. And I actually learned a lot from this article. So

00:11:50.180 --> 00:11:53.560
I don't feel like I was doing too many things wrong, but there was a couple of things I'm like,

00:11:53.560 --> 00:11:58.880
oh yeah, I should probably do that. And other stuff just that is really nice. So I just want to run you

00:11:58.880 --> 00:12:03.800
through a couple of things that I learned. And if you want to hear more about how we're using micro WSGI,

00:12:03.800 --> 00:12:10.800
you can check that out on Talk Python 215. Dan Bader and I swap stories about how we're running our

00:12:10.800 --> 00:12:16.240
thing, our various things, you know, Talk Python training and realpython.com and whatnot.

00:12:16.300 --> 00:12:22.320
So this is guidance from Bloomberg's engineering structured products application group.

00:12:22.320 --> 00:12:27.880
That's quite the title. And they decided to use micro WSGI because it's really, you know,

00:12:27.880 --> 00:12:34.680
good for performance, easy to work with. However, they said micro WSGI is, as it's maturing,

00:12:34.680 --> 00:12:40.300
some of the defaults that made sense when it was new, like in 2008, don't make sense anymore.

00:12:40.520 --> 00:12:46.980
The reason is partly just because the way people use these sites is different or these servers is

00:12:46.980 --> 00:12:53.200
different. For example, doing proxies up in front of micro WSGI with say Nginx, that used to not be

00:12:53.200 --> 00:12:58.880
so popular. So they made these defaults built into the system that maybe don't make sense anymore.

00:12:58.880 --> 00:13:02.800
And so what they did is we're going to go through and they said, we're going to go through and talk

00:13:02.800 --> 00:13:07.820
about all the things that we're going to override the defaults for. And why unbit the developer

00:13:07.820 --> 00:13:14.380
micro WSGI is going to fix all of these bad defaults in the 2.1 release. But right now it's 2.0

00:13:14.380 --> 00:13:20.360
as of this recording. So you're going to have to just, you know, hang in there or apply some of these

00:13:20.360 --> 00:13:25.100
changes. Now, I do want to point out one thing. When I switched on a lot of these, I did them one at a

00:13:25.100 --> 00:13:31.020
time. And the way you get it to reload its config is you say, relaunch the process, restart the process

00:13:31.020 --> 00:13:37.580
with a system CTL, like a daemon management thing from Linux. And one of their recommendations

00:13:37.580 --> 00:13:44.460
is to use this flag die on term, which is for it to die on a different signal that it receives.

00:13:44.460 --> 00:13:49.720
And for whatever reason, maybe I'm doing it wrong. But whenever I turn that on, it would just lock up

00:13:49.720 --> 00:13:54.400
and it would take about two minutes to restart the server because it would just hang until it

00:13:54.400 --> 00:13:58.100
eventually timed out. It was like forcefully killed. So that seems bad. So I'm not using that.

00:13:58.200 --> 00:14:03.080
But I'll go quickly over the settings that I use that I thought were cool here. So there's,

00:14:03.080 --> 00:14:07.340
you've got these complicated config files. If you want to have make sure everything's validated,

00:14:07.340 --> 00:14:11.820
you can say strict equals true. That's cool. That will verify that everything that's typed in the

00:14:11.820 --> 00:14:17.300
file is accurate and is valid because it's kind of forgiven at the moment. Master is true is a good

00:14:17.300 --> 00:14:22.740
setting because this allows it to create worker processes and recycle them based on number of requests

00:14:22.740 --> 00:14:28.020
and so on. Something that's interesting, I didn't even realize you could do. Maybe tell me if you knew

00:14:28.020 --> 00:14:33.400
this was possible in Python apps. You can disable the GIL, the global interpreter lock. You can say,

00:14:33.400 --> 00:14:36.160
you know what, for this Python interpreter, let's not have a GIL.

00:14:36.160 --> 00:14:37.300
Wow. How does that work?

00:14:37.300 --> 00:14:41.200
Yeah. Well, it's, I mean, people talk about having no GIL. It's like, oh, you can do all this cool

00:14:41.200 --> 00:14:46.000
concurrency and whatnot. But what it really means is you're basically guaranteeing you can only have

00:14:46.000 --> 00:14:52.040
one thread. So if you try to launch, say, a background job on a micro WSGI server and you don't

00:14:52.040 --> 00:14:57.260
pass enable threads, it's true. It's just going to not run because there's no GIL and there's no way to start it.

00:14:57.840 --> 00:15:03.000
That's something you want to have on. Vacuum equals true. This one I had off and I turned it on.

00:15:03.000 --> 00:15:07.980
Apparently this cleans up like temporary files and so on. Also a single interpreter. It used to be that

00:15:07.980 --> 00:15:12.960
micro WSGI was more of an app server that might have different versions of Python and maybe Ruby as well.

00:15:12.960 --> 00:15:17.760
And this will just say, no, no, it's just the one version. A couple other ones. You can specify the

00:15:17.760 --> 00:15:24.040
name that shows up in like top or glances. So it'll say like, you can give it, say your website name,

00:15:24.040 --> 00:15:30.440
and it'll say things like that thing worker process one or that thing master process or whatnot. And so

00:15:30.440 --> 00:15:35.640
there's just a bunch of cool things in here with nice descriptions of why you want these features.

00:15:35.640 --> 00:15:39.840
So if you are out there and you're running micro WSGI, give this a quick scan. It's really cool.

00:15:40.200 --> 00:15:46.940
Now, this next one also is pretty neat. So this one comes from the people who did spaCy, right? What do

00:15:46.940 --> 00:15:47.560
they got going on?

00:15:47.560 --> 00:15:54.040
Yep, that's right. So this was just released a couple of days ago and it's called Think and they bill it as

00:15:54.040 --> 00:15:59.480
a functional take on deep learning. And so basically what it there's, if you're familiar with deep

00:15:59.480 --> 00:16:05.320
learning, there's kind of two big competing frameworks right now. TensorFlow and PyTorch and

00:16:05.320 --> 00:16:11.160
MXNet is also in there. So the idea of this library is that it abstracts away some of the boilerplate

00:16:11.160 --> 00:16:15.960
that you have to write for both TensorFlow and PyTorch. PyTorch has a little bit less. TensorFlow

00:16:15.960 --> 00:16:22.520
with Paris on top also has a little bit less, but you end up writing a lot of the same kind of stuff.

00:16:22.520 --> 00:16:28.520
And there's also some stuff that's obfuscated away from you, specifically some of the matrix

00:16:28.520 --> 00:16:36.280
operations that go on under the hood. And so what Think does is, so it already runs on spaCy, which is

00:16:36.280 --> 00:16:42.200
an NLP library under the covers. So what the team did was they surfaced it so that other people could

00:16:42.200 --> 00:16:49.080
use it more generically in their projects. And so it has that favorite thing that we love. It has type

00:16:49.080 --> 00:16:55.320
checking, which is particularly helpful for tensors when you're trying to get stuff and you're not sure

00:16:55.320 --> 00:17:01.880
why it's not returning things. It has classes for PyTorch wrappers and for TensorFlow. And you can

00:17:01.880 --> 00:17:07.560
intermingle the two if you want to, if you have two libraries that bridge things. It has deep support

00:17:07.560 --> 00:17:12.920
for NumPy structures, which are the kind of the underlying structures for deep learning.

00:17:12.920 --> 00:17:19.400
It operates in batches, which is also a common feature of deep learning projects. So they process

00:17:19.400 --> 00:17:25.160
features and data in batches. And then it also sometimes a problem that you have with deep

00:17:25.160 --> 00:17:31.400
learning is you're constantly tuning hyperparameters or the variables that you put into your model to

00:17:31.400 --> 00:17:36.600
figure out how long you're going to run it for, how many training epochs you're going to have, what size

00:17:36.600 --> 00:17:41.880
your images are going to be. Usually those are those clustered in the beginning of your file is kind of

00:17:41.880 --> 00:17:47.720
like a dump or a dictionary or whatever. It has a special structure to handle those as well. So it basically

00:17:47.720 --> 00:17:53.000
hopes to make it easier and more flexible to do deep learning, especially if you're working with

00:17:53.000 --> 00:17:58.440
two different libraries and it offers a nice higher level abstraction on top of that. And the other

00:17:58.440 --> 00:18:04.280
cool thing is, is they have already released all the examples and code that are available in Jupyter

00:18:04.280 --> 00:18:08.120
notebooks on their GitHub repo. So I'm definitely going to be taking a closer look at that.

00:18:08.120 --> 00:18:12.440
Yeah, that's really cool. They have all these nice examples there and even buttons to open them in

00:18:12.440 --> 00:18:19.240
Colab, which is, yeah, that's pretty awesome. This looks great. And it looks like it's doing some work

00:18:19.240 --> 00:18:25.640
with FastAPI as well. I know they hired a person who's maintaining FastAPI, which is cool. Also,

00:18:25.640 --> 00:18:31.400
their Prodigy project. So yeah, this looks like a really nice library that they put together. Cool. And

00:18:31.400 --> 00:18:37.560
Ines has been on the show before. Ines from Explosion AI here as a guest co-host as well. Super

00:18:37.560 --> 00:18:41.640
cool. That's awesome. Yeah. This next one I want to talk about, you know, I'd love to get your opinion

00:18:41.640 --> 00:18:45.000
because you're more on the data science side of things, right? Yeah. Yeah. So this next one, I want

00:18:45.000 --> 00:18:51.160
to tell folks about, this is another one from listeners. You know, we talked about something that

00:18:51.160 --> 00:18:55.320
validates Panda Panda's and like, oh, you should also check out this thing. So this comes from

00:18:55.320 --> 00:19:02.280
Jacob Deppin. Thank you, Jacob, for sending us in. And so it's pandas dash vet. And what it is,

00:19:02.280 --> 00:19:09.560
is a plugin for Flake 8 that checks Panda's code. And it's this opinionated take on how you should use

00:19:09.560 --> 00:19:16.120
Panda's. They say one of the challenges is that the, if you go and search on Stack Overflow or other

00:19:16.120 --> 00:19:20.520
tutorials, or even maybe video courses, they might show you how to do something with Panda's,

00:19:20.520 --> 00:19:26.200
but maybe that's a deprecated way of working with pandas or some sort of old API. And there's,

00:19:26.200 --> 00:19:32.200
there's a better, better way. So the idea is to make pandas more friendly for newcomers by trying

00:19:32.200 --> 00:19:36.760
to focus on best practices and saying, don't do it that way. Do it this way. You know, read CSV. It

00:19:36.760 --> 00:19:43.720
has so many parameters. What are you doing? Here's how you use it. Things like that. So this is based on a

00:19:43.720 --> 00:19:50.360
talk or this linter was created. The idea was sparked by a talk by Ania Kupsinski.

00:19:50.360 --> 00:19:55.160
Sorry, I'm sure I blew that name bad, but at PyCascades 2019 in Seattle,

00:19:55.160 --> 00:19:59.160
I want your code responsibly. So I'll link to that as well. So it's kind of cool to see the evolution.

00:19:59.160 --> 00:20:05.640
Like Ania gave a talk at PyCascades and then this person's like, oh, this is awesome. I'm going to

00:20:05.640 --> 00:20:09.880
actually turn this into a Flake 8 plugin and so on. What are your thoughts on this? Do you like this idea?

00:20:10.040 --> 00:20:13.880
Yeah, I'm a huge fan of it. I think in general, there's been kind of like this,

00:20:13.880 --> 00:20:18.520
I wouldn't want to say culture war about whether notebooks are good or bad. And there was recently

00:20:18.520 --> 00:20:24.440
a paper released, I want to say, not a paper, but a blog post a couple of days ago about how you should

00:20:24.440 --> 00:20:31.720
never use notebooks. There was a talk by Joel Gruse last year about all the things that notebooks are

00:20:31.720 --> 00:20:36.200
bad with. I think they have their place. And I think this is one of the ways you can have,

00:20:36.200 --> 00:20:42.120
I want to say, guardrails around them and help people do things. I like the very opinionated

00:20:42.120 --> 00:20:47.080
warning that they have here, which is that DF is a bad variable name. Be kinder to yourself,

00:20:47.080 --> 00:20:51.160
because that's always true. You always start with the default of DF and then you end up with

00:20:51.160 --> 00:20:56.360
34, 35 of them. I joke about this on Twitter all the time, but it's true. So that's a good one.

00:20:56.360 --> 00:21:03.160
The loc and the .ix and loc, iloc is always a point of confusion. So it's good that they have that.

00:21:03.160 --> 00:21:08.920
And then the pivot table one is preferred to pivot or on stack. So there's a lot of places. So Pandas is

00:21:08.920 --> 00:21:14.520
fantastic, but there's a lot of these places where you have old APIs, you have new APIs, you have people

00:21:14.520 --> 00:21:20.520
who usually are both new to Python and programming at the same time coming in and using these. So this

00:21:20.520 --> 00:21:24.600
is a good set of guardrails to help write better code if you're writing it in a notebook.

00:21:24.600 --> 00:21:28.600
Oh yeah, that's super cool. Do you know, is there a way to make Flake 8 run in the notebook

00:21:28.600 --> 00:21:32.440
automatically? I don't know. You probably can, yeah. It probably wouldn't be too hard.

00:21:32.440 --> 00:21:36.760
Yeah, but I don't know. Yeah, but I think it's interesting that you ask that because that's not

00:21:36.760 --> 00:21:41.800
generally that's something you would do with notebooks. But maybe this kind of stuff will push it in the

00:21:41.800 --> 00:21:48.040
direction of being more like what we consider quote unquote mainstream or just web dev or

00:21:48.040 --> 00:21:52.280
backend programming. Yeah, cool. Well, I definitely think it's nice if I were getting started with

00:21:52.280 --> 00:21:57.000
Pandas, give this a check. You also, if you're getting started with Pandas, you may also be getting

00:21:57.000 --> 00:22:03.000
started with NumPy, right? Yep. So NumPy is the backbone of numerical computing in Python.

00:22:03.480 --> 00:22:08.680
So I talked about TensorFlow, PyTorch, machine learning in the previous stories. All of that

00:22:08.680 --> 00:22:15.000
kind of rests on the work and the data structures that NumPy are created. So Pandas, scikit-learn,

00:22:15.000 --> 00:22:19.720
TensorFlow, PyTorch, they all lean heavily, if not directly depend on the core concepts, which include

00:22:19.720 --> 00:22:27.080
matrix operation through the NumPy array, also known as an ND array. The problem was with ND arrays is

00:22:27.080 --> 00:22:32.440
they're fantastic, but the documentation was a little bit hard for newcomers. So Anne Bonner wrote

00:22:32.440 --> 00:22:38.840
a whole new set of documentation for people that are both new to Python and scientific programming,

00:22:38.840 --> 00:22:45.160
and that's included in the NumPy docs themselves. Before, if you wanted to find out what arrays were,

00:22:45.160 --> 00:22:50.280
how they worked, you could go to the section and you could find out the parameters and attributes and

00:22:50.280 --> 00:22:55.320
all the methods of that class. But you wouldn't find out how or why you would use it. And so this

00:22:55.320 --> 00:23:00.360
documentation is fantastic because it has an explanation of what they are. It has visuals of

00:23:00.360 --> 00:23:05.880
what happens when you perform certain operations on arrays. And it has a lot of really great resources if

00:23:05.880 --> 00:23:10.520
you're just getting started with NumPy. The strong recommend for me, if you're doing any type of data

00:23:10.520 --> 00:23:15.400
work in Python, especially with Pandas, that you become familiar with NumPy arrays. And this makes it

00:23:15.400 --> 00:23:21.880
really easy to do so. Yeah, nice. It has things like, how do I convert a 1D array to a 2D array?

00:23:21.880 --> 00:23:28.120
Or what's the difference between a Python list and a NumPy array and whatnot? Yeah, it looks really

00:23:28.120 --> 00:23:34.120
helpful. I like the why. That's often missing. You'll see like, you do this, use this function for

00:23:34.120 --> 00:23:40.280
this and here are the parameters. Sometimes they'll describe them, sometimes not. And then it's just like,

00:23:40.280 --> 00:23:44.440
well, maybe this is what I want. Stack Overflow seemed to indicate this is what I want. I'm not

00:23:44.440 --> 00:23:47.800
sure. I'll give it a try. Right. So I like the little extra guidance behind it. That's great.

00:23:47.800 --> 00:23:52.840
Yeah, it does a really good job of orienting you. Cool. All right. Well, Vicki, those are our main

00:23:52.840 --> 00:23:58.760
topics for the week. But we got a few extra quick items just to throw in here at the end. I'll let you

00:23:58.760 --> 00:24:04.760
go first with yours. Sure. This is just a bit of blatant self-promotion about who I am. So I am a

00:24:04.760 --> 00:24:10.760
data scientist on the side. I write a newsletter that's called Norm Cortac. And it's about all the things that I'm

00:24:10.760 --> 00:24:15.960
not seeing covered in the mainstream media. And it's just a random hodgepodge of stuff. It ranges from

00:24:15.960 --> 00:24:23.000
anything like machine learning, how the data sets got created initially for NLP. I've written about Elon Musk

00:24:23.000 --> 00:24:30.200
memes. I wrote about the recent raid of the Nginx office in great detail and what happened there. So there's a free

00:24:30.200 --> 00:24:36.760
version that goes out once a week and paid subscribers get access to one more paid newsletter per week. But really, it's more

00:24:36.760 --> 00:24:41.560
about the idea of supporting in-depth writing. So it's just vicki.substack.com.

00:24:41.560 --> 00:24:48.600
Cool. Well, that's a neat newsletter. And I'm a subscriber. So very, very nice. I have a quick one for you all

00:24:48.600 --> 00:24:57.240
out there. And maybe two, actually. One, pip20.0 was released. So not a huge change. Obviously,

00:24:57.960 --> 00:25:04.520
this is compatible with the stuff that I did before and whatnot. But it does a couple of nice things. And I

00:25:04.520 --> 00:25:10.360
think this is going to be extra nice for beginners because it's so challenging. You go to a tutorial

00:25:10.360 --> 00:25:15.560
and it says, all right, the first thing you got to do to run whatever, I want to run Flask or I want to

00:25:15.560 --> 00:25:21.640
run Jupyter is you say pip install Flask or pip install Jupyter. And it says you do not have permission to,

00:25:21.960 --> 00:25:27.960
you know, write to wherever those are going to install, right? Depending on your system. And so

00:25:27.960 --> 00:25:35.480
if that happens now in pip20, it will install as if --user was passed into the user profile.

00:25:35.480 --> 00:25:36.360
That's cool, huh?

00:25:36.360 --> 00:25:37.240
That's really neat.

00:25:37.260 --> 00:25:42.620
Yeah, yeah. So that's great. And cache wheels are built from GitHub requirements and a couple of

00:25:42.620 --> 00:25:48.380
other things. So yeah, nothing major, but nice to have that there. And then also, I'd previously

00:25:48.380 --> 00:25:54.700
gone on a bit of a rant saying I was bugged that homebrew, which is how I put Python on my Mac,

00:25:54.700 --> 00:26:01.580
was great for installing Python 3 until 3.7. So if you just, it's even better because if you just say,

00:26:01.580 --> 00:26:07.100
brew install Python, that means Python 3, which not legacy Python, which is great.

00:26:07.100 --> 00:26:12.700
But that sort of stopped working. It still works, but it installs Python 3.7. So that was,

00:26:12.700 --> 00:26:18.380
that was kind of like a, oh, sad face. But I'm sorry, I forget the person who sent this over on

00:26:18.380 --> 00:26:24.040
Twitter. But when the listener sent in a message said, you can brew install Python at 3.8. And that

00:26:24.040 --> 00:26:25.960
works. Why? That's not...

00:26:25.960 --> 00:26:29.720
Is it safe to brew again? I've just started downloading directly from Python.

00:26:29.880 --> 00:26:35.640
I know, exactly. Exactly. So I'm trying it today. And so far, it's going well. So I'm

00:26:35.640 --> 00:26:40.880
really excited that on macOS, we can probably get the latest Python. Even if you got to say the

00:26:40.880 --> 00:26:47.780
version, I just have an alias that re-aliases what Python means in my ZSHRC file. And it'll just say,

00:26:47.780 --> 00:26:51.940
you know, if you type Python, that means Python 3.8 for now. Anyway, I'm pretty...

00:26:51.940 --> 00:26:55.100
Yeah, fingers crossed. So it looks like it's good. And that's nice. Hopefully,

00:26:55.100 --> 00:26:58.600
it just keeps updating itself. I suspect it will, at least within the 3.8 branch.

00:26:58.600 --> 00:27:00.860
All right. You ready to close this out with a joke?

00:27:00.860 --> 00:27:01.300
Yeah.

00:27:01.300 --> 00:27:07.460
Yeah. So I'm sure you've heard the type of joke, you know, a mathematician and a physicist walk into a

00:27:07.460 --> 00:27:14.240
bar and, right, well, some weird thing about numbers and space ensues. So this one is kind of like

00:27:14.240 --> 00:27:21.500
that one. It's about search engine optimization. So an SEO expert walks into a bar, bars, pub,

00:27:21.500 --> 00:27:27.160
public house, Irish pub, tavern, bartender, beer, liquor, wine, alcohol, spears, and so on.

00:27:27.160 --> 00:27:30.000
It's bad, huh?

00:27:30.000 --> 00:27:32.020
I like that. That's nice.

00:27:32.020 --> 00:27:37.180
Yeah, it's so true. Like, you remember how blatant websites used to be like 10 years ago,

00:27:37.180 --> 00:27:42.440
they would just have like a massive bunch of just random keywords at the bottom. Just, you know,

00:27:42.440 --> 00:27:43.460
like it seems like...

00:27:43.460 --> 00:27:45.420
Yeah. And sometimes they would be in white and white text.

00:27:45.420 --> 00:27:46.720
Yes, exactly. White and white.

00:27:46.720 --> 00:27:49.800
You can see them, but then if you highlight it, it would be like whole three paragraphs.

00:27:49.800 --> 00:27:55.120
Here's where the SEO hacker went. I don't think that works so well anymore. But yeah,

00:27:55.120 --> 00:28:00.140
it's a good joke nonetheless. And Vicky, it's been great to have you here. Thanks so much for

00:28:00.140 --> 00:28:04.420
filling in for Brian and sharing the data science view of the world with us.

00:28:04.420 --> 00:28:05.140
Thanks for having me.

00:28:05.140 --> 00:28:05.660
You bet. Bye.

00:28:05.660 --> 00:28:05.980
Bye.

00:28:06.460 --> 00:28:11.020
Thank you for listening to Python Bytes. Follow the show on Twitter via at Python Bytes. That's

00:28:11.020 --> 00:28:17.600
Python Bytes as in B-Y-T-E-S. And get the full show notes at pythonbytes.fm. If you have a news

00:28:17.600 --> 00:28:22.480
item you want featured, just visit pythonbytes.fm and send it our way. We're always on the lookout for

00:28:22.480 --> 00:28:27.520
sharing something cool. On behalf of myself and Brian Okken, this is Michael Kennedy. Thank you for

00:28:27.520 --> 00:28:30.140
listening and sharing this podcast with your friends and colleagues.

