Brought to you by Michael and Brian - take a Talk Python course or get Brian's pytest book


« Return to show page

Transcript for Episode #152:
You have 35 million lines of Python 2, now what?

Recorded on Wednesday, Oct 9, 2019.

00:00 KENNEDY: Hello, and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is Episode 152, recorded October 9, 2019. I'm Michael Kennedy.

00:10 OKKEN: And I'm Brian Okken.

00:11 KENNEDY: And this episode is brought to you by Digital Ocean. Check them out at pythonbytes.fm/digitalocean, get $50 credit for new users. Now, we may have touched on this concept of Legacy Python before, Brian. Have we covered it?

00:23 OKKEN: Yeah, I think we have.

00:24 KENNEDY: We definitely have. So we know that there are companies out there that say, "It's really tricky for us to upgrade to Python 3 because..." And sometimes that's because, I don't know, just they don't put the resources into it, right? They would rather work on features rather than going back and rewriting old code to do the same thing, so it's not so old. Things like that. Other times it's because they have a ton of Python code. And we're hearing more and more stories of these companies that have been like head in the sand, waiting until the very, very last minute to make those migrations. And they're just like, "Alright, finally, somebody has raised it to the level that it has to be dealt with." Right? Well, it turns out that banks use a lot of Python code, as we know. And I've heard of Bank of America using a ton of code, and having a lot of people working on some Python projects. But JPMorgan, JPMorgan Chase, they use maybe even more. They use a ton of Python. So there's an article that's based on this presentation by Misha Tselman, who is the executive director at JPMorgan Chase, about this. It was given at PyData 2017. So they've been working on it. But the problem is they have 35 million lines of Python 2 code.

01:36 OKKEN: Oh, that's a lot.

01:36 KENNEDY: In terms of Python code, that's kind of ridiculous, right? That's an insane amount. So they've got a lot of Python code that has to be converted to Python 3, and this is from their Athena trading platform, which is at the core of their business operations, right? So they got a late start to migrating to Python 3 and people point out this could be a security risk for them, right? We saw what happened with Equifax. Some outdated things there. Who knows what the risks are. I think it's probably less than something like the web frameworks that were out of date other places. But yeah, they have a lot of stuff that has to be migrated. And internally they use Python for pricing, trading, risk management, analytics, and even machine learning. So just to look at some stats from this project. The feature set utilizes 150,000 Python modules, over 500 open source packages, 35 million lines of Python code contributed by 1,500 developers.

02:36 OKKEN: Okay, so they got a big team.

02:37 KENNEDY: That's a huge scale, and by the way, I wonder how much JPMorgan Chase is contributing back to those 500 open source projects.

02:44 OKKEN: Hopefully some.

02:45 KENNEDY: Alright, now, it says they're going to miss the deadline, Alright? That most of the strategic elements are going to be in place by Q1 2020, but they can't do it all, and it's probably a good road map for folks, right? They don't have to upgrade it all, and then release that new thing, right? They could upgrade elements at a time. There's a lot of great stories on how folks have done that. I think probably the Instagram project was the most awesome one I've seen where they didn't even branch, right? They just found a way to seamlessly move from Python 2 to 3, while centering in on 2, then finally flipping the switch. Here's another one I thought you would find interesting though. They have some other stats. On your projects, how often do you commit code? It's like once a week? Once a day? Once an hour?

03:29 OKKEN: Yeah, several times a day.

03:30 KENNEDY: Yeah. I'm kind of same. I do that. And you guys don't really release stuff, but say the Python Bytes website or the Talk Python training site, those probably do some form of website release every other day. Some sort of deploy, restart. Run through the whole deployemt process. So JPMorgan Chase uses continuous delivery with continuous integration, continuous delivery, with 10,000 to 15,000 production changes a week.

03:58 OKKEN: That's amazing.

03:59 KENNEDY: That's like mind blowing, isn't it? So, they're on it, I guess. But it's just such a project of massive scale that it's hard to get your mind around and hard to find analogies. So I'm sure there's a few other projects like this in the world, but can't be many.

04:13 OKKEN: No. Wow. That's like one a second. Or faster.

04:17 KENNEDY: It's constant redeploy. It's got to be microservices and other stuff, right? Otherwise, how would you go to the website? How would you use its services? Anyway. Quite incredible. Alright, well, whatcha got for our next one?

04:28 OKKEN: This is just kind of a cool little tool called Organize, and it was suggested from Ariel Barkan on Twitter. And I took a look at this and I'm going to start using it right away. It's a Python-based file management automation tool. And the idea is people are lazy with how they save files and download files and whatever. And on my Mac, for example, all the screenshots just show up on the desktop. And then occasionally I'll just take everything and lump them into a clutter folder or something. But this is a tool where you can give it rules in the yaml file and say, have it do things like move all your screenshots from the desktop into a screenshots folder. Or look through all your downloads to look at the incomplete downloads that you canceled or something, they're still sitting there. And just trash those if they're older than a few days old or something. Doing things like removing empty files from certain folders, like your download or desktop, or other places. One of the examples is to organize your receipts and invoices into date-based folders, which is pretty cool, because there's macros involved that you can look at the file, touch time, and figure out what date, and extrapolate the dates and stuff. And, yeah, I always, when I'm paying bills or something, I save the receipt to just wherever, in the downloads folder or something. And having this, just running this every once in a while could clean it up and put everything in its place. It's pretty cool.

05:57 KENNEDY: That's super cool. You could just put it on a cron job that runs every five minutes or every minute or something, right? It just goes, "Woop." It's got to be super quick. Just looks at the file, A few folders, and then does some text matching.

06:08 OKKEN: It's one of those automate the boring stuff sort of things that somebody thought, "You know, everybody has this problem, so." Yeah, it's nice.

06:14 KENNEDY: Yeah, I like it. I have the same problem with receipts and stuff. I'll get them in an email or as a PDF attachment or actually it's just an email that I'll print to PDF so that I can save it for taxes, and they just clutter up, yeah. It's like, I could totally see just using that. The rules seem like they're rich enough to do that, so. Yeah, looks really good.

06:33 OKKEN: Yeah, super cool.

06:34 KENNEDY: Alright, speaking of cool, let me just tell you about Digital Ocean. So all of our services run on Digital Ocean. Audio you're listening to now somehow flowed through the Digital Ocean servers to get to you. And they've got all sorts of great options out there that are simple, but powerful. There's not knobs to run, absolutely every little edge case, right? You set up the main server that you want to work with, you have spaces, you have hosted databases in MySQL and Postgres, and you even have caching like Redis, things like that. So super nice, check them out at pythonbytes.fm/digitalocean and get $50 credit for new users. Highly recommended. Now, this next one is a fun one, and it took me a minute to realize what this was about, Brian. So I realized there's this new PEP, PEP 589. And it allows you to define typed dictionaries. Like define a type that represents a dictionary. Well, it turns out, there was already a way to do that, which is why I was confused, because there's PEP 484, which has been around for a while, which lets you create a dict of k,v which is like here's a dictionary of arbitrary keys and it has, maybe, integers, or it has user objects, or whatever, right? So you can define these uniform dictionaries, which is kind of interesting, but this new PEP, it lets you go much farther. it's proposed by Jukka Lehtosalo? Lehtosalo. It's actually sponsored by Guido van Rossum. So remember recently we spoke about Guido and we had this philosophical debate of like, well, he's all about typing these days, but originally typing was explicitly left out of the language. What's the story? So here's another typing thing that he's participating in, which I think is interesting. This is accepted, it's scheduled for 3.8, so all sorts of interesting stuff, and it's coming down the line, right? Soon, actually. So what it lets you do is, imagine you have an arbitrary JSON document, or an arbitrary Python dictionary, really, but it's super easy to think of like, "Well, somebody sends me a JSON request and I'm going to treat it as if I know what's happening here." It lets you actually specify the shape of those things, both the keys, as well as the values. Potentially nested documents, right? So you might have a JSON object that's got some values and those values might be a list of other JSON documents. You can describe that with this type dict thing. So the way it works kind of caught me off guard at first, but I think I like it. So what you do, instead of just saying there's a dictionary of stream,user you actually create a class, which derives from TypedDict. Okay? And then it has fields. It looks a lot like data classes a little bit. So you might have a name:str, and a year:int. In this thing that is not actually the dictionary, but it is the type that validates the dictionary. Alright?

09:30 OKKEN: Oh, okay.

09:31 KENNEDY: And then you can say it is one of those, right? So I say... the example they give is there's some movies. So you say movie:Movies, name the class, and then it's just a dictionary. But the dictionary has the name, which is a string, value, and a year, which is an integer value, and so on. And then you can actually validate it, and the static-type checker, like mypy, so on, will, if you say movie of director, it'll say, "No, no, no. You can't set this value into this dictionary, because it doesn't have a key called director. Or if you try to set the year to the string, "1982" in quotes, it'll say, "No, no. This is a string. It expected an integer."

10:08 OKKEN: But the errors come at the type checking time, right?

10:11 KENNEDY: This is at type checking time. Although, it's totally reasonable that things like PyCharm and VS Code would add edit time checking for this, as well. Because they do for all the other type stuff.

10:20 OKKEN: Yeah, but it's not a runtime.

10:21 KENNEDY: It's not a runtime thing. Yeah, all the typing stuff. And this is definitely that way. So you're not re-implementing the dictionary. You're not creating a dictionary type that is different. You create a type which, then, talks about just like a plain dictionary. So quite interesting, actually.

10:39 OKKEN: Yeah, it does take a little while to look it and go, "Does this make sense?" But, yeah, it does.

10:43 KENNEDY: Right. Imagine you're getting, you're writing API. And somebody's submitting a JSON post to you. And you want to know is it valid? Right? You could use this basically to validate your schema. Or at least describe the schema you expect.

10:55 OKKEN: Yeah, neat.

10:56 KENNEDY: It is neat. Speaking of APIs and new web things, your next one is one of those, right?

11:02 OKKEN: I got carried down that rabbit hole. No, that's cool. The next one, I was just enticed by the name. So there's a package called Gazpacho. It's just great. It's fun to say, it's fun to eat. But anyway, Gazpacho is a web scraping library and the goal of it is to replace requests and Beautiful Soup for most web scraping projects. And I got to tell you, I was going to do... I have some web scraping projects that I wanted to do. And I know that requests and Beautiful Soup are easy to use and they're super powerful, but that one use-case where you're just doing a get, then you parse it, and then you find some stuff in it and separate it out. That's so common that this is basically optimizing for that. There's an example article that I'll link to, also, that uses Gazpacho to scrape hockey data for the use of fantasy sport use. But it's just a really simple interface. You import, from Gazpacho, you import get and Soup as a class, and you can use those to grab some HTML and parse it. Find some stuff in there. It's just a handful of lines of code and you've got a web scraper on your hands. I like it. I think I'll give it a shot. But I tried it out, and I wanted to bring this up, because I tried it out. And I ran into a problem that I was getting these certificate errors. Have you ever gotten certificate errors when you're trying to parse things, or pull things down?

12:30 KENNEDY: Yeah, just once or twice. And it's the kind of thing where you bounce off the walls of Stack Overflow until you get it fixed and then you forget how to fix them. So, yeah, what did you do?

12:40 OKKEN: I did the same thing. Went to Stack Overflow. And apparently, within the... and I don't know if this is just a Mac thing or not, but on Macs at least, when you install Python, in the install directory, in applications Python 3 whatever, there's a file called installcertificates.cmd. And you just have to run that. And then it has the list of certificates or something. I don't know how certificates work. But it makes it so that you can access SSL stuff from Python, so. Ran into that today.

13:13 KENNEDY: That's right. I'm glad you're linking to it, so now we'll have it forever. Yeah, that's cool. It's nice Gazpacho is like two to three times faster than Beautiful Soup, which is pretty sweet. I like that.

13:26 OKKEN: It also does a lot less, so that makes sense.

13:28 KENNEDY: Yeah, for sure.

13:28 OKKEN: It's a more focused thing.

13:29 KENNEDY: It's like the 80% case, though, right? You just need to go do simple things.

13:33 OKKEN: That's where I'm going to use it for.

13:34 KENNEDY: So the last thing I want to cover for our main items is pip. So remember, actually, we spoke about PyDist? P-Y-D-I-S-T? Yeah. This is like a private PyPI as a service, I guess, is kind of the way I would describe it. So right now, I think they... before we talked about this and like, "Well, just in beta doesn't seem to have any pricing or anything like that." So they have pricing and a little bit more details. They've more or less launched at this point. So this article's not about this, but it was written by folks who run that. That's the connection back to the previous thing. And it talks about how pip install works. So for this section, I just want to talk to you real quick about, when you say pip install certifi, like it did in that previous article you just mentioned, to fix your certificates, what do you do? How does it work? Alright, so it walks you through all the steps and all the decisions and whatnot that pip has to make when you say pip install some package. So the first thing is has to decide, well, first I guess, does the package exist, right? And then it needs to figure out which distribution of the package to install? Because we have eggs, we have wheels, we have source. We have all these different types of distributions. There are seven different kinds of distributions, but the most common are either source distributions or binary wheels, so focus on those, right? So source distribution is just, here's your Python code, and maybe the C code that comes with it, and as part of the setup we're going to run a compiler against the C code to make sure that that's compiled in your machine, right? Super easy to write, not so easy to make sure it works everywhere, not just works on my machine, right? Because you've got to have compilers on all the platforms and, oh yeah, what about that old version of Windows that was a minimal install and doesn't have GCC or Visual Studio, or whatever. So wheels are a little bit more safe, and also faster. But that means they have compiled C code, which has to be, you have to have multiple ones for different platforms, right? So Windows versus macOS or something. The benefit is stuff installs fast, right? So like NumPy takes about four minutes to compile from source. So if you did a source dist of NumPy, pip install might be slower than you would otherwise expect. Alright? So, anyway.

15:58 OKKEN: The four minute pip install, yes, that's slower than...

16:01 KENNEDY: Yeah, that's before you even hit the dependencies, right? That's just the primary thing. Yeah, okay, so it has to figure out which one those are. There's actually a known URL, so like pypi.org/simple/packagename is where you'd go. So you could go to that, /requests, for example. And there is a huge, just flat, it's like a weird API. Like HTML list of a bunch of wheels, of platform names, and tarballs, and all sorts of stuff. So it starts out by going there to figure out, well, what is here? What can I find? So first it determines what system you're on and what's compatible with things. So if you have a binary wheel, there's actually a PEP that talks about how you figure out which one that is. And if it's a source dist, well, you just assume it works. So once it has that then it'll try to get the best, and it prefers wheels. And then it has to figure out the dependencies. So for binary wheels there's a file called metadata that has a list of those. So that's cool, you can just look at that. If it's a source distribution, it figures it out by running the setup.py. So that's interesting. So it'll run setup.py to actually figure out what dependencies it has to install, and it'll go do that. And then you might have two dependencies. You might have a thing, and you might depend on, let's say, Beautiful Soup, but you also have some other library that also depends on Beautiful Soup if you follow the dependency tree. And they might even specify versions. So you might wonder, "Well, what happens if one depends on one version, the other depends on the other?" Turns out, it just installs it anyway. Let's take the latest. That's going to be fine, right? That's different than like a requirements file that has different dependency, pinned versions. There's a slight difference there. So, finally, gets it, builds it, installs it. And then it has to figure out where's the path? Am I going to install it to a virtual environment? Am I going to install it into the system, or the user path? Things like that. So you can look at sys.prefix to figure out which one of those are. And there's some environment variables. Ooh! And finally it copies it over in the right place, and your package installed... Oh, before it considers your package installed also it converts the source files into PYC bytecode files. So they don't have to get parsed again. Then your package is installed. Yeah, so anyway.

18:18 OKKEN: Simple.

18:19 KENNEDY: Yeah, so if you're wondering what happens as part of the pip install stuff, there's a lot of details, and I didn't cover all of it. But as much as I thought made sense.

18:27 OKKEN: I was just curious, I was going to try to find one of those complicated packages.

18:32 KENNEDY: Yeah?

18:32 OKKEN: That I knew had to be compiled. Because I went to a couple of mine and they're just Python code, so there's just one perversion, one wheel. But like NumPy, for instance, I know it's got some compiled code in it. It's got, I lost count. It's like 15, 16, 17 different wheels for each version.

18:51 KENNEDY: Yeah, requests has got a ton as well. It's interesting.

18:54 OKKEN: It is interesting how that works. I'm glad it all works so that I don't have to think about it too much.

18:58 KENNEDY: I don't have to think about it either. But it turns out there's a lot of conversation in there about some stuff that's not totally solved, even today, right? About trying to resolve the dependencies in a totally predictable way before you start installing anything and stuff like that. So it's worth checking out.

19:14 OKKEN: It's a hard problem.

19:15 KENNEDY: Yep, for sure. Want to finish up with a cool trick? Like a zoo trick? A zoo animal trick?

19:19 OKKEN: Oh, yeah. I'm just zoning today. So Kevin Markham, he runs, what's the thing he runs?

19:26 KENNEDY: Data School. Dataschool.io.

19:26 OKKEN: Data School. Plus, he's a super nice guy. He's doing something neat that's called daily pandas tricks, or tricks and tips or something like that. But anyway, we'll get a link to it. He's sending out a little tip or trick about pandas every day on Twitter, and the page we're linking to has a whole bunch of them already built in. And I like the notion of just trying to fit something, often they're little screenshots, but they're still pretty small. A little lesson of how to do something cool. I just picked out one, which is let's say you wanted to rename all of the columns in a DataFrame the same way. To replace all the spaces with underscores, or something. And he just shows you how to do that in a little thing. I think that's neat, especially for something, for a package like pandas. There's a whole bunch of stuff you can do with it to have a way to just see a little extra new thing everyday. To say, "That's something that I might use. I'll keep looking at that later." or something. So, I don't think we've talked about it before and I think it's a cool thing he's doing, so wanted to highlight it.

20:31 KENNEDY: Yeah, it's definitely a cool thing he's doing and pandas is one of those things where it's not always obvious. All the little magic that you can do, right? Like if you want to go to the columns and do string operations, just dataframe.columns.str.apply your operation. Right? That's, after you use it for a while it's obvious. But maybe not right away. It definitely isn't to me. Pandas feels a little like magic to me.

20:55 OKKEN: I'm looking at this going, "I would not have guessed that."

20:57 KENNEDY: Exactly. It's not obvious, but once you know it, it's like, "Well, of course, that's better than like..." There's this saying that if you find yourself looping over things in NumPy or pandas, you're probably doing it wrong.

21:08 OKKEN: One of the nice fun things I think is if you get really good at something you'll start learning the things that you shouldn't do, but that are fun. And some of Kevin's tips are, "You can do this, it's sort of fun. "But don't, because it's confusing to other people. But anyway, here's the trick." It's neat that he's including those.

21:29 KENNEDY: It's clever, but too clever sometimes. Cool. Alright, so do you have any extras to share?

21:33 OKKEN: Oh. No, only that we just got finished with our first Python West meetup last night, and it was both exhausting and really fun. So thanks for helping out with that.

21:44 KENNEDY: Yeah, you bet. Good job putting it together. It came out really well. Everyone seemed to have a great time. There was a totally good turnout. I was blown away that it was basically sold out. Not sold out, but booked out, on its very first run, which is crazy. And people out there listening, if you want to come and give a talk at the meetup and you're willing to find your way to Portland, shoot a message to Brian or me and let us know. That'd be cool.

22:09 OKKEN: Yeah. That would be cool. And then, before anybody asks, it was not recorded. So you have to be here. How about you? You've got some news to share.

22:17 KENNEDY: I got all sorts of stuff. A few really quick things. One, I upgraded to macOS Catalina yesterday. And so far, so good. No major problems, all the Python things seem to be working. So if you're, when you're... I did hear that someone out there was having trouble with Miniconda? I don't use Miniconda, so I have no idea about that. Maybe do a Google search if that matters to you. Also, Brian, I switched to working with Adobe Audition. I've been using Audacity and GarageBand. Finally broke down and paid the $30 a month for Adobe Audition, and wow, is it worth it. It's so good. What has been wrong with me to not do that? I just didn't want to learn new software. It's not so much about the money, it's just like I don't want to learn new hotkeys. I already know the hotkeys. But it's so super good. The reason I bring it up on the show instead of after, is if you hear weird artifacts or something odd in the audio, call our attention to it. Because there's all these dials and knobs that can do things like chop off the Ses at the end of words if you turn them too far, and stuff like that. So hopefully things sound better. If they don't, let us know. And then the two Python-related things. Really quick, Azure Databricks also is dropping support for Python 2, so just one more brick to fall for Legacy Python. The Python death clock continues to toll for those who hang on to their Python 2. And the folks over on the VS Code team, Ron Liu, in particular, just announced that at PyCon China they just revealed a cool new Jupyter UI Variable Explorer and telesense stuff for basically running Jupyter notebooks inside VS Code. So if you're a VS Code user and you care about Jupyter, check that out.

23:57 OKKEN: Very cool.

23:57 KENNEDY: Yeah. Absolutely. Absolutely. Well, that's it for the stuff. I got a story for you, a joke, maybe.

23:57 OKKEN: Yes, please.

23:57 KENNEDY: This one comes to us from maybe an unexpected space. A person on Twitter goes by The Sarcastic Pharmacist, sent us this actually really good joke and nice comment. And the theme is that it's hard to distinguish between what is super easy in programming and what is nearly impossible for people who are not doing the programming themselves. So this is actually an xkcd article 1425. It's got a programmer, a woman, sitting there working at her desk. And there's like a manager-type who comes up and is issuing feature requests. Okay? The manager, I'm going to think of one of the people from Office Space, maybe. He comes over and says, "When a user takes a photo with the app, it should check whether they're in a national park." The woman says, "Sure, easy. Easy GIS lookup, give me a few hours." "Oh yeah, and it should also check whether the photo's a bird." And she says, "I'll need a research team and five years." And the subtitle is, "Yes, it can be hard to explain the difference between the easy and the virtually impossible." So there you go. That resonates a lot with me, at least.

23:57 OKKEN: Yeah, we'll probably get a bunch of the image people telling us that it's like five minutes now with all the new image libraries to do a bird.

23:57 KENNEDY: Yeah, but that's now. We probably should see, I should see if there's a date for this, just to be fair. They don't have dates on these. That's kind of funky. Alright, anyway, well, there's probably some algorithm that figures out the number, the xkc, and maps it back to a date, but, yeah.

23:57 OKKEN: Yeah, but that's funny. Cool.

23:57 KENNEDY: Alright, well, great to chat with you as always.

23:57 OKKEN: Thank you.

23:57 KENNEDY: Yep.

23:57 OKKEN: Bye.

23:57 KENNEDY: Bye. Thank you for listening to Python Bytes. Follow the show on Twitter via @pythonbytes. That's Python Bytes as in B-Y-T-E-S. Hey, and get the full show notes at pythonbytes.fm. If you have a news item you want featured, just visit pythonbytes.fm and send it our way. We're always on the lookout for sharing something cool. On behalf of myself and Brian Okken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.

Back to show page