Transcript #10: Dismissing Python's Garbage Collection, PyPI Name Reservations, and Hackers Exfiltrate US Government Data to Save Itself

Recorded on Monday, Jan 23, 2017.

00:00 This is Python Bytes, Python headlines and news delivered directly to your earbuds.

00:05 It's episode 10, recorded Monday, January 23rd, 2016.

00:10 This is Michael Kennedy. I'm here with Brian Aukin.

00:12 Hey, Brian. You ready to talk about Python for the week?

00:15 I'm very ready, yes.

00:16 Yeah, we got some really cool stuff, and we have a pretty interesting data story to wrap things up where people are out rescuing data with Python, I'm pretty sure.

00:25 So, let's save that for the end and let's get started with time.

00:29 We had mentioned in episode 7 that Matplotlib 2.0 was coming out, and it is.

00:35 It's official now as of January 17th.

00:38 And there's an article from a website called BlackArbs called "Advanced Time Series Plots in Python." And I thought it was a nice focused tutorial.

00:50 Matplotlib does a whole bunch of stuff, and it's a little bit overwhelming if you only use it once in a while.

00:56 So this is a focused tutorial.

00:58 It's good.

00:59 It uses Pandas, NumPy, Matplotlib, of course, and a library called Seaborn that I had never heard of before.

01:06 It's a statistical data visualization add-on, I think, for Matplotlib.

01:12 Looks good.

01:12 But the article goes on with--

01:14 starts with an empty XY chart and gets some financial data from Yahoo and does some manipulation of it to plot it out.

01:23 But then, now the defaults aren't ever exactly what you want.

01:26 So this tutorial goes through how to add like shaded areas for, in this example it was for times where there were recessions.

01:39 If there's something special about part of your data you want to highlight, there's a way to do that.

01:44 Adding chart titles and axis labels and styling the legend.

01:48 I never knew you could do this, the different formatting for the XY tick labels and if like this chart had both positive and negative numbers and if you want to draw a zero line or a line anywhere that's a horizontal line shows how to do that. Turning on data points and one of the things I didn't even know you could do a couple things were to put chart annotations for to describe specific plot points and then put arrows pointing to that piece of the data.

02:20 And then adding logos and watermarks over the top.

02:23 And it's a nice, short, concise tutorial.

02:26 I like it.

02:27 - Yeah, it's really nice.

02:28 And it's super approachable.

02:30 It just shows you, like, here's the code steps from here to there.

02:33 Get some financial data from Yahoo and do some graphs.

02:36 And yeah, if you need to graph stuff, right, Python is getting better and better.

02:40 So very, very nice.

02:42 Speaking of pictures, Instagram has a lot of pictures.

02:45 Yeah, yeah, and they actually run a significant amount of traffic through Django and through Python.

02:54 They use a pretty interesting way of configuring their servers.

02:59 It's not at all unlike the way I do for my talk Python websites as well in fact, Python bytes as well.

03:05 And so let me just quote from their little article here and says how we run our web server.

03:10 Instagram's web servers run Django in a multi process mode with a master that process that forks itself a dozens of times to create worker process that actually handle the requests.

03:19 And they're doing that with micro WSGI in a pre fork mode so they can leverage shared memory across the master and the 2050 whatever worker processes that it controls because they're all kind of running the same code for the most part.

03:33 So why do you need, you know, 20 copies of Python loaded into memory and things like that.

03:39 Right.

03:40 So they're thinking about how can they get their code to run faster?

03:42 How can they run more of these forked processes and basically add more concurrency per server?

03:50 And so they came up with a pretty radical idea.

03:52 They said, "What if we just turn off garbage collection in Python?" That seems bad, right?

03:58 It does seem bad, yeah.

04:00 So they wrote an article called "Dismissing Python Garbage Collection" at Instagram.

04:05 What they did is they had a thing where they would monitor the memory of the worker processes I believe restart them if the memory got too out of control.

04:12 I don't know why they're doing that, but for whatever reason they did.

04:15 And I think that they found that the shared memory was much, much less shared than they expected.

04:22 So, they said that each worker process was like 225 megs of memory usage, and they're only sharing like 140 megs, even though they're basically all doing the same thing.

04:31 So they went to look and sort of do some profiling, and they thought that there was some Linux copy on write behaviors happening due to reference counting. So they said, all right, let's turn off reference counting. That seems like really bad. Because for those of you guys who don't know, Python's GC or sort of memory management works in two modes. There's a direct reference counting deterministic thing that when the last reference to an object goes away, it's typically deleted instantly. But the problem with reference counting for garbage collection and memory management is you can get cycles.

05:03 So there's a GC that manages cycles.

05:06 - Okay.

05:07 - And so what happened is they tried to turn off reference counting, that didn't help.

05:10 They said, all right, well, let's turn off GC.

05:12 And they said, when they turned off garbage collection, they were still able to run about the same, only like accepting sort of memory leaks from cycles, I guess, and they said they successfully raised the shared memory from 140 megs to 225.

05:25 I guess that was, I was quoting a little bit the wrong number, but they dramatically increased the shared memory, and they were able to drop the memory usage per server by eight gigs.

05:36 And they saved 25% RAM on their entire Jenga fleet.

05:39 In fact, they actually got it to run 10% faster due to the way the rearrange memory done by garbage collection actually lines up with CPU caches and things.

05:50 So they go they wrote this really detailed article with like lots of stats and the command you can run to do these things.

05:56 And they said, actually, for us, for our weird, ultra high performance, ultra heavy load servers, it makes sense to turn off Python's garbage collection and basically set up a custom shutdown thing that doesn't collect too much memory on the way out the door.

06:12 Isn't that interesting?

06:14 - It's very interesting.

06:15 I didn't even know you could turn off the garbage collection.

06:18 - There's apparently a way that you can do it, but it's not simple, yeah.

06:23 And they ran into problems, Like they turned it off and other libraries would ensure that it was on and turn it back on and things like that.

06:29 They're like, no, I want it off.

06:30 And so they had, I think they had to like recompile some things, you know, it's not like a flag, but anyway, it's pretty interesting.

06:38 - Interesting story.

06:39 - Yeah, it's a thing that questions some assumptions about how things work and it's sometimes, I definitely does not sound pretty, but you know, maybe you're optimizing that last 10, 20%.

06:54 It's not gonna be pretty, but this is a pretty interesting look at what Instagram's doing.

06:58 I just like to look at these large scale deployments and people have handling tons of requests, what they're doing, it's always interesting.

07:05 - Yeah, and also somebody else out there might be thinking, hey, maybe we could turn it off and try to learn from Instagram.

07:12 - Yeah, absolutely, they detailed it out there.

07:14 They have a nice engineering blog.

07:15 They put interesting stuff up there frequently.

07:17 - But Brett Cannon, no, I forget where Brett Cannon works.

07:21 Brett Cannon works with Dino Veland over at Microsoft.

07:25 I think the Azure Data Science team and their Brett Cannon is a Python core developer.

07:32 - Yeah, we've learned a lot of interesting things from Brett, but he's got a blog called snarky.ca and he put up my experience with type hints in mypy.

07:44 I was attracted to this article because I've sort of been watching this type hint thing with a little skepticism.

07:52 So I'm not really sure if it's going to help, how it can help me.

07:56 So I was interested in reading this.

07:58 He took a project that he supports, and then used both mypy and TypeHince to try to, who he's hoping to improve the quality of it.

08:12 And it's kind of an interesting project.

08:15 I didn't even know what CLA was.

08:16 So it's a CLA enforcement bot for the Python Software Foundation.

08:21 And that's the contributor license agreement.

08:24 So the idea is if somebody pushes or submits a pull request into one of the Python projects, this robot goes out and makes sure that the person that is requesting the pull request has assigned CLA.

08:41 Sounds like it's obvious why this is needed.

08:43 But it's a small enough project to run this experiment on adding type hints. And he, it's not a very long article, but he talked, points to his pull request for this, the changes he made, and the lessons learned. I didn't know what mypy was before, so mypy is a static analysis tool that makes sure you're, I think it make, just make sure your type hints are correct or uses the type hints for linting. Yeah, I think so. And you know the the pep that you talked about originally, I think there was all Python three stuff, I'm pretty sure that that all works on Python two.

09:23 Now at least with some of the tooling that the type in ideas are being backported with the idea being that if we can take a lot of this legacy Python code, give it some static type definitions, and then use tooling more intelligently with more information to check the upgrade to Python three.

09:42 So actually it can provide some structure to the older code actually.

09:47 So it's, you know, I always thought of typefaces being this thing that only Python 3 gets and it's kind of this new deal, but you know, it may turn out to be most important in crossing that Python 2 to Python 3 chasm, which is pretty interesting.

10:01 It'd be interesting benefit.

10:02 You know what else is interesting?

10:04 The second largest, the second most active contributor to this mypy project, Guido van Rossum.

10:11 Yeah, he's really doing a lot with this.

10:15 There was a couple hiccups that he ran into.

10:18 Apparently there's some of the 3.6 isn't fully supported, but I'm sure that'll get supported fairly quickly.

10:26 What I really wanted to highlight was, I'm going to quote from his conclusion.

10:31 After this experiment, he asked himself, would he bother using typing and static typing in new Python 3 code?

10:38 And essentially he said yes, once it supports f strings, apparently at the time it doesn't, I don't know if it's fixed yet.

10:46 Yeah, I think it only supported Python 3.5 and f strings, of course, 3.6, which is shiny new.

10:53 But he says, "When I design an API, I already have to think about what type of objects would be acceptable.

11:00 So quickly writing down my assumptions doesn't hurt anything.

11:03 It's relatively quick and it benefits anyone having to work with my code.

11:08 also wouldn't contort my code to fit within the confines of type hints. For example, if type hints force me to write cleaner code, then that's great, but if something is so dynamic that it can't have type hints, then that's fine, and I'll happily use typing.any as an escape hatch. In the end, I view type hints as enhanced documentation that has tooling to help verify that the documentation about types is accurate. And for that use case, I see type hints worth doing and not at all a burden. And he also mentions that it didn't make his code less readable, it only enhanced the readability. And since I'm all for making your code more readable, this article actually gives me a more positive light on using types.

11:55 Yeah, that's cool. I do periodically put type hints in my Python three code. And I find that I usually do it when I'm looking at a function. I'm like, you know, I don't really remember the context in which this is called and what what is being passed in here. So let me make it really explicit at the function signature level. And then the tooling can of course tell me if we're doing it wrong somewhere else. But it's kind of a reminder to me if it's not obvious what's going on, but I don't do it like universally. It's more like in those places where I'm knowing the types going to really help.

12:27 Yeah, and I'm still spending, I guess, too much time straddling the legacy Python, Python 3 world that I'm still having trouble with that chasm.

12:37 Yeah, sure.

12:38 Well, the mypy works on Python 2, so there you go.

12:42 You know what else is quite pervasive throughout Python?

12:45 Underscores.

12:46 The underscore is special in Python.

12:48 Is it special?

12:49 It is special.

12:50 You know, there's a lot of things that are special or, you know, maybe a more common way to say it would be idiomatic, right, or Pythonic. I think the use of the underscore is one of these symbols that shows up a lot and they actually have meaning but people coming from other languages don't necessarily know all the nuanced behavior of the underscore or not behavior meaning of the underscore.

13:14 And so there's this article called "Understanding the underscore of Python" which is kind of a cool title. It gives you some examples. It says if you write If you come across some Python code, you might see something like for underscore in range 10 or Dunder in itself things like that. You see it all the time But these underscores have meanings and they said look there's basically five cases With some sub cases thrown in about when you might have special meaning with this underscore The first one is if you're in the REPL if you're in just the you know, type Python Going to the REPL whatever you type the last value is always underscore Right, so if you call a function and you forget to store its value, you can just say underscore and that's that's whatever that function returned.

14:00 For example, I didn't know that before reading this article.

14:03 That's pretty cool.

14:04 Yeah, it's really handy.

14:06 If you did something that took like 20 seconds, you forgot to put that in a variable, you can just do still have it.

14:10 It's called underscore.

14:12 That's cool.

14:13 But that's not really in code, right?

14:14 The first example I gave you for underscore in range of 10.

14:18 That's kind of the I don't care.

14:20 I have to put something here, but I want to explicitly make the point that I don't care about this variable.

14:26 So there's there's interesting places where that might come up.

14:30 Right?

14:31 This four underscore is an example, maybe I just want to do a loop 10 times, I don't actually care about the number, I want to do it 10 times.

14:37 So that's, that's one way to do it if you have a, so that's just that's just a convention, right?

14:43 So what we're, what's happening there is the underscore is just a character, which is a valid variable name.

14:49 - Yes, exactly.

14:50 Except for the linters, the linting tools know about it.

14:54 - Oh, okay.

14:55 - So for example, in PyCharm, if you have a function that takes a parameter and you're not using that parameter, then it'll make it gray and give it a little squiggly saying here's an unused argument.

15:07 Examples come to mind in the Dunder init in Pyramid, there's a startup thing that passes a configuration settings thing.

15:15 And if you don't care about the configuration settings thing, Well, what are you going to do?

15:20 Well, you can just put an underscore and then you don't get warnings that that variable is unused.

15:26 Now, let's see, if you're doing tuple unpacking and you've got a tuple of five things, you want the first and the fourth or whatever, you can put some underscores in there to make it unpack correctly all over the place so that you have this I don't care variable and the linters actually will not warn you that underscore is not used.

15:42 Okay, cool.

15:43 That's cool.

15:44 Then you have, like, giving meaning to variables.

15:47 Underscore variable is like a, some people say private.

15:51 I think of it more as a protected variable.

15:53 So if I've got a class that has a field or attribute, like, underscore name or whatever, right, it hints to consumers of that class they should leave it alone, but technically it's still accessible.

16:03 I think of that more as protected, right?

16:05 It still leaves it accessible to derived classes, but it does tell people stay away.

16:09 Yeah, I also think of that as, like, I don't make any promises that I might not change this behavior. Yes. Yeah. Yeah. Yeah, exactly. All right. And you can have double underscore in a class which will actually rename it so it is much harder to get a hold of like name mangling so that's like real private. And if you have a conflicting thing you can have like if you have a variable you want to call it in well there's a keyword called in like key in dictionary, you can just say in underscore sequel alchemy has like and underscore or underscore in underscore so it can have things that look very similar to restricted keywords but you know avoids them and then you have the whole Python data data model with like dunder init, dunder new, dunder representation, string so on like all the magic methods and then there's a few that I didn't really know about well one I guess in the internationalization of strings there's some special meaning there it's kind of complex I'm not going to go into it and then in Python 3.6 you can use it as like where you would have commas for separating for digit grouping you can have underscore so like one underscore zero zero zero underscore zero zero zero is a million in Python 3.6. Yeah and it's it's this is good also just for reading other people's code to understand how they're using it. Yeah because you might just think oh that's a weird naming convention that that person used here but like they're trying to communicate something very specific by using underscore and all these circumstances so I think it's good to like bring it up people maybe come from other languages.

17:33 Yeah, especially people that come from Perl because Perl has a very special meaning for underscore.

17:38 So it's good to know that it isn't that.

17:40 Yeah, you know one thing that I think is frustrating is if there's like a super old PyPI package.

17:46 It hasn't been updated in five years, but it's got like that perfect name.

17:49 There should be some active project that could sort of inherit its name.

17:53 There was a pep that came out on the 12th by Donald stuffed.

17:57 and it's in draft status currently but I'm sure that with probably a few tweaks by people that are really in the Python core people it'll probably go through but it's called package it's pep 541 package index name retention and it's basically just like what you said is given that a package name in the index is it's a flat namespace the unique names are a finite resource So as the package index is growing and growing older, we need to have a way to deal with things like old packages that aren't updated anymore, somebody that just like squatted on it but hasn't done anything with it, or, you know, there's other various things or conflict of two people that legitimately want to use it.

18:45 How do we resolve those conflicts?

18:48 Yeah, like an example comes to mind for me is there's a cool package for kind of a crummy problem called suds. So consuming soap web services in Python is not pretty because they have all their namespaces. It's just like soap is like a big gnarly XML format for web services that's like preceded or was sometimes more popular within enterprises than JSON HTTP services. So So there's a cool package called suds that has like a really great API for consuming those on the client side But it's kind of old and outdated it only supports Python 2 So a guy whose username is Jerko created No, no reflection on him personally Created a project called suds - Jerko So if you wanted to have in the package name and code is just suds but it's like basically that thing upgraded for Python 3.

19:42 And he couldn't use suds because even though it was not being maintained, it still owned it on PyPI.

19:50 Yeah, that's a tough one though.

19:51 And even if something's not updated, it's hard to know if people are using it or not.

19:56 Some of the stuff that was addressed in this proposal is things like checking out other uses of the word like on CPAN or NPM or GitHub.

20:10 So if it's a very popular GitHub repository, maybe we should let it go through or something.

20:17 I actually wanted to put in a package name to expect in.

20:23 It was already used by somebody else, but then I went out and looked and it was legitimate.

20:27 There's a lot of other tools that were more like that tool than what I wanted to use.

20:33 something like, it would have been bad for me to try to throw a hissy fit and grab that.

20:38 Yeah, it's interesting. It's cool that they're proposing a way to look at prior art and existing things.

20:43 I'm actually a little surprised it didn't already exist, but I'm glad it's happening.

20:47 Yep, it's definitely a good thing. I mean, we're about to come up on 100,000 PyPI packages.

20:52 Like, the naming is starting to become something scarish, right?

20:56 All right, so did you ever read William Gibson books like Neuromancer and some of these crazy futuristic cyberpunk type books?

21:05 No.

21:06 I kind of feel like we're living in this sort of sci-fi future in a weird way.

21:13 There was the craziest article that I saw I've seen in a while come out and I'll just read you the title.

21:19 Hackers downloaded US government climate data and stored it on European servers as Trump was being inaugurated.

21:27 So here's the deal.

21:28 There's, you know, I'm going to try to make this as politically neutral as possible.

21:33 I just, I just want to talk about the data side of this in a pretty interesting way.

21:38 Like who owns and controls data, right?

21:41 Does the president, does the country, does the world own this data?

21:46 Like the US has an incredible amount of data about the world at the Environmental Protection Agency at NASA, Department of Energy, and so on. And there had been some rumors that when Trump took office, he was going to reduce, cut back, or even fire some people that worked on climate change. People knew this a few months before, maybe a month before, there were people looking around saying, you know, these websites, like the EPA, for example, has a bunch of data on it.

22:18 And they're like, what if that gets deleted?

22:21 There was actually a meeting of 60 programmers and scientists at the Department of Information Studies, UCLA, and their goal was to find as many of these websites with government data on it, and start to scrape it and download it and exfiltrate it from the country.

22:41 Isn't that crazy?

22:43 So what they did is they all got together And they were working like 22 hours a day for like the four or five days leading up to the inauguration.

22:52 And they were just downloading, they had like a huge spreadsheet of hundreds of sites where they'd go in and they'd try to get the data and they would pull it off.

22:59 So examples include like a webpages dedicated to Department of Energy solar project initiatives or the Energy Information Administration stuff about fossil fuels compared to renewable energy and fuel cell research and those kinds of things.

23:14 And it turns out there's actually these sort of, what they called volunteer data rescue events in Toronto and Philly and Chicago and Indianapolis and Michigan over the last weeks trying to scrape hundreds of thousands of pages off the internet. So you might say people are being crazy come on. It turned out exactly at noon as Trump was sworn in and UCLA was actively working that that group was working all the climate change related data on whitehouse.gov disappeared but they had gotten a lot of it. Okay. And so So there's this company called Page Freezer, and I think they're a Canadian company, and they basically create snapshots of web pages, a little bit like an internet archive or the Wayback Machine, and it's actually a commercial service.

23:57 So these guys start to participate and they're storing all of this data somewhere in Europe.

24:03 - The reason for storing it in Europe because there's no, like, there'd be not a way for the government to force it to shut down or something?

24:12 - Yeah, exactly, exactly.

24:15 So I just, I thought that was like, that's the craziest headline around programming I've seen this week. - Yeah.

24:21 - And I wanted to include that.

24:22 - Definitely.

24:23 Now, I do know that there's, so again, to try to not be too political on this, but aren't there legacy versions of the old versions of the White House website?

24:35 - There, I mean, probably on the Wayback Machine and stuff, but it could be that it doesn't necessarily store the data length in it, you know what I mean?

24:44 like the 100 meg CSV file or whatever.

24:48 And that was the kind of stuff they were actually going after.

24:51 - Well, it's good to have it saved.

24:53 - It's good to have it saved, so we'll see where it goes.

24:58 But I just thought this was a really interesting, and I'm sure Python played a pretty key role in a lot of screen scraping initiatives with Beautiful Soup and Scrapy and things like that.

25:08 - Yeah.

25:08 - All right, so I think we're gonna leave it there for this week.

25:11 That's a lot of cool news for everyone, Brian.

25:13 Definitely cool.

25:14 All right, well thanks for talking to me this week.

25:16 You bet.

25:17 Thanks for sharing the stories and I'll catch you next time.

25:20 Thank you for listening to Python Bytes.

25:22 Follow the show on Twitter via @PythonBytes, that's Python Bytes as in B-Y-T-E-S.

25:28 And get the full show notes at PythonBytes.fm.

25:31 If you have a news item you want featured, just visit PythonBytes.fm and send it our way.

25:36 We're always on the lookout for sharing something cool.

25:38 On behalf of myself and Brian Okken, this is Michael Kennedy.

25:42 Thank you for listening and sharing this podcast with your friends and colleagues.

Back to show page