« Return to show page
Transcript for Episode #10:
Dismissing Python's Garbage Collection, PyPI Name Reservations, and Hackers Exfiltrate US Government Data to Save Itself
00:00 KENNEDY: This is Python Bytes, Python headlines and news delivered directly to your earbuds. It’s Episode #10, recorded Monday, January 23, 2017. This is Michael Kennedy. I’m here with Brian Okken.
00:00 Brian. Are you ready to talk about Python for the week?
00:00 OKKEN: I’m very ready, yes.
00:00 We’ve got some really cool stuff and we have a pretty interesting data story to wrap things where people are out rescuing data with Python, I’m pretty sure.
00:00 let’s save that for the end and let’s get started with time.
00:00 We had mentioned in Episode #7 that matplotlib 2.0 was coming out and it is. It’s official now as of January 17th, and there’s an article from a website called blackarbs.com called, “Advanced Time Series Plots in Python.” I thought it was a nice focused tutorial. Matplotlib does a whole bunch of stuff. It’s a little bit overwhelming if you only use it once in a while. This is a focused tutorial. It uses Pandas, NumPy, matplotlib, of course, and a library called Seaborn, that I had never heard of before. It’s a statistical data visualization add-on, I think, for matplotlib. It looks good.
00:00 article starts with an empty XY chart and gets some financial data from Yahoo and does some manipulation of it to plot it out. But then the defaults aren’t ever exactly what you want. This tutorial goes through how to add shaded areas. In this example, it was for times where there were recessions. If there’s something special about part of your data that you want to highlight, there’s a way to do that, adding chart titles and access labels and styling the legend. I never knew you could do this, the different formatting for the XY tick labels. This chart had both positive and negative numbers and if you want to draw a zero line or a line anywhere, a horizontal line, it shows you how to do that, turning on data points. And one of the things I didn’t know you could do, a couple things, were to put chart annotations to describe specific plot points and arrows pointing to that piece of the data, and then adding logos and watermarks over the top. It’s a nice, short, concise tutorial. I like it.
00:00 Yeah, it’s really nice. It’s super approachable and just shows you, ‘Here’s the code steps from here to there.’ Get some financial data from Yahoo and do some graphs. If you need to graph stuff, Python is getting better and better, so very nice.
00:00 of pictures, Instagram has a lot of pictures. And they actually run a significant amount of traffic through Django and through Python. They use a pretty interesting way of configuring their servers. It’s not at all unlike the way I do it for my Talk Python websites as well, in fact Python Bytes as well. Let me just quote from their little article here. It says:
00:00 We Run Our Web Server: Instagram’s web server runs Django and a multi-process mode with a master process that forks itself to create worker processes that take incoming user requests. For the application server, we use uWSGI with pre-fork mode to leverage memory sharing between master and worker processes.”
00:00 doing that with micro-WSGI in a pre-fork mode so they can leverage shared memory across the master and the 20/50 whatever worker processes that it controls. They’re all kind of running the same code for the most part, so why do you need 20 copies of Python loaded into memory and things like that. So, they’re thinking how can they get their code to run faster, how can get they more of these forked processes and basically add more concurrency per server. They came up with a pretty radical idea. They said, “What if we just turn off garbage collection in Python?” That seems bad, right?
00:00 It does seem bad, yeah.
00:00 (Laughing) So, they wrote an article called, “Dismissing Python Garbage Collection at Instagram.” What they did is, they had a thing where they would monitor the memory of the worker processes and, I believe, restart them if the memory got too out of control. I don’t know why they were doing that, but for whatever reason they did, I think that they found that the shared memory was much, much less shared than they expected. So, they said that each worker process was like 225 MB of memory usage and they’re only sharing 140 MB, even though they’re basically all doing the same thing.
00:00 went to look, and do some profiling and they thought that there was some Linux copy on right behaviors happening due to reference counting. They said, ‘Alright, let’s turn off reference counting.’ That seems really bad. For those of you guys who don’t know, Pythons GC or memory management works in 2 modes. There’s a direct reference counting deterministic thing, that when the last reference to an object goes away it’s typically deleted instantly. But, the problem with reference counting for garbage collection and memory management is you can get cycles. So, there’s a GC that manages cycles and so what happened is, they tried turning off reference counting, that didn’t help. They said, ‘Alright, let’s turn off GC.’ When they turned off garbage collection, they were still able to run about the same, only accepting memory leaks from cycles. They said they successfully raised the shared memory 140 MB to 225 MB. They dramatically increased the shared memory and they were able to drop the memory usage per server by 8 GB. And they saved 25% RAM on their entire Django fleet. In fact, they actually got it to run 10% faster, due to the way the rearranged memory by garbage collection actually lines up with CPU caches and things.
00:00 wrote this really detailed article with lots of stats and the commands you can run to do these things. They said, ‘Actually, for us, for our weird, high performance heavy load servers, it makes sense to turn off Python’s garbage collection.’ Basically, set up a custom shut-down thing that doesn’t collect too much memory on the way out the door. Isn’t that interesting?
00:00 It’s very interesting. I didn’t even know you could turn off the garbage collection.
00:00 There’s apparently a way that you can do it but it’s not simple. And they ran into problems. They turned it off and other libraries would ensure that it was on and turn it back on and things like that. I think they had to re-compile some things. It’s not like a flag.
00:00 It’s definitely an interesting story.
00:00 It questions some assumptions about how things work and that sometimes it definitely does not sound pretty. But, you know, maybe you’re optimizing that last 10, 20%. It’s not going to be pretty but this is a pretty interesting look at what Instagram is doing. I just like to look at these large-scale deployments and people handling tons of requests. It’s always interesting.
00:00 Yeah. Also, somebody out there might be thinking, ‘Hey, maybe we could turn it off and try to learn from Instagram.’
00:00 Absolutely, they detailed it out there. They have it nice engineering blog. They put interesting stuff up there frequently.
00:00 Brett Cannon, I forget where Brett Cannon works.
00:00 Brett Cannon works with Dina Veeland over at Microsoft. I think the Azure Data Sites Team and their Brett Cannon is a Python Core Developer.
00:00 Yeah, we’ve learned a lot of interesting things from Brett. He’s got a blog called snarky.ca and he put up “My Experience with Type Hints and Mypy.” I was attracted to this article because I’ve sort of been watching this type hint thing with a little skepticism so I’m not really sure how it can help me. So, I was interested in reading this. He took a project that he supports and used both mypy and type hints to try to improve the quality of it. It’s an interesting project. I didn’t even know what CLA was. It’s a CLA enforcement bot for the Python Foundation. And that’s the Contributor License Agreement. The idea is if somebody submits a pull request into one of the Python projects, this robot goes out and makes sure that the person that is requesting the pull request has a signed CLA. It’s obvious why this is needed. It’s a small enough project to run this experiment on adding type hints. It’s not a very long article but he points to his pull request for this, the changes he made, the lessons learned. I didn’t know what mypy was before. Mypy is a static analysis too that makes sure your type hints are correct, or uses the type hints.
00:00 Yeah, I think so. The PEP that you talked about, originally, I think that was all Python 3.x stuff. I’m pretty sure that all works on Python 2.x, at least with some of the tooling. The type hint ideas are being back-ported, with the idea being that if we can take a lot of this Legacy Python code, give it some static type definitions and then use tooling more intelligently with more information to check the upgrade to Python 3.x. So, actually, it can provide some structure to the older code.
00:00 always thought of type hints being this thing that only Python 3.x gets and it’s kind of this new deal but it may turn out to be most important in crossing that Python 2.x to Python 3.x chasm, which is pretty interesting.
00:00 It may be an interesting benefit.
00:00 You know what else is interesting? The second largest, second most active contributor to this mypy project? Guido Von Rossum.
00:00 Oh, really?
00:00 Yeah, he’s really doing a lot with this.
00:00 There was a couple hiccups that he ran into apparently. Some of the 3.6 isn’t fully supported, but I’m sure that will get supported fairly quickly.
00:00 I really wanted to highlight was, I’m going to quote from his conclusion. After this experiment, he asked himself, would he bother using typing in new Python 3.x code, and essentially, he said yes. Once it supports F strings. At the time, it doesn’t; I don’t know if it’s fixed yet.
00:00 I think it only supported Python 3.5 and F strings are, of course, 3.6.
00:00 He says:
00:00 I design an API, I already have to think about what type of objects would be acceptable so quickly writing down my assumptions doesn’t hurt anything. It’s relatively quick and it benefits anyone having to work with my code. But I also wouldn’t contort my code to fit within the confines of type hints. For example, if type hints forced me to write cleaner code then that’s great. If something is so dynamic that it can’t have type hints, then that’s fine and I’ll happily use typing.any as an escape hatch. In the end, I view type hints as an enhanced documentation that has tooling to verify that the documentation about types is accurate. And for that use case I see type hints worth doing and not at all a burden.”
00:00 also said it didn’t make his code less readable, it only enhanced the readability. Since I’m all for making your code more readable, this article gives me a more positive light on using types.
00:00 That’s cool. I do, periodically, put type hints in my Python 3.x code and I find that I usually do it when I’m looking at a function. I don’t really remember the context of what this is called and what it’s being passed in here, so let me make it really explicit at the function signature level, and then the tooling can of course tell me if we’re doing it wrong somewhere else. It’s kind of a reminder to me; it’s not obvious what’s going on but I don’t do it universally. It’s more like in those places where I know the types really going to help.
00:00 I’m still spending too much time straddling the Legacy Python/Python 3.x world and still having trouble with that chasm.
00:00 Yeah, sure. Well, the mypy works on Python 2.x, so there you go.
00:00 know what else is quite pervasive throughout Python? Underscores. The underscore is special in Python.
00:00 Is it special?
00:00 It is special. You know, there’s a lot of things that are special or maybe a more common way to say it would be idiomatic, or Pythonic. I think the use of the underscore is one of these symbols that shows up a lot. And they actually have meaning, but people coming from other languages don’t necessarily know all the nuanced meaning of the underscore.
00:00 there’s this article called, “Understanding the Underscore (_) of Python.” Which is a cool title. It gives you some examples. It says, if you come across some Python code you might see something like for _in range(10) or _init_(self). Things like that; you see it all the time. These underscores have meaning. They said, ‘Look there’s basically 5 cases with some subcases thrown in about when you might have special meaning with this underscore.’ The first one is, if you’re in the REPL, if you just type python and go into the REPL, whatever you type, the last value is always underscore. If you call a function and you forget to store its value you can just say underscore (_) and that’s whatever that function returned.
00:00 I didn’t know that before reading this article; that’s pretty cool.
00:00 Yeah, it’s really handy. If you did something that took 20 seconds and thought, ‘Ah, I forgot to put that in a variable’ you still have it. It’s called underscore.
00:00 that’s not really in code, right. The first example I gave you, 4_in range(10), that’s kind of the ‘I don’t care, I have to put something here. I want to explicitly make the point that I don’t care about this variable.’ So, there’s interesting places where that might come up. There’s 4_ (underscores) in an example. I just want to do a loop 10 times, I don’t actually care about the number. I just want to do it 10 times. That’s one way to do it.
00:00 That’s just a convention, right? So, what’s happening there is the underscore is just a character which is a valid variable name?
00:00 Yes, exactly. Except the linting tools know about it. For example, in PyCharm, if you have a function that takes a parameter and you’re not using that parameter then it will make it gray and give it a little squiggly saying, ‘Here’s an unused argument.’ Examples come to mind in the dunder_init_ in Pyramid there’s a startup thing that passes a configuration settings thing. And if you don’t care about the configurations settings thing, well, what are you going to do? You can just put underscore (_) and then you don’t get warnings that that variable is unused.
00:00 you’re doing tuple unpacking and you’ve got a tuple of 5 things and you want the 1st and the 4th or whatever, you can put some underscores in there to make it unpack correctly. It’s all over the place. You have this ‘I don’t care’ variable and the linters actually will not warn you that underscore is not used.
00:00 then you have giving meaning to variables, _variable_ is like a protected variable. So, if I’ve got to class and it has a field or attribute like _name or whatever, it hints to consumers of that class that they should leave it alone but technically, it’s still accessible. I think of that more as protected. It still leaves it accessible to derived classes but tells people ‘stay away.’
00:00 I also think of that as, I don’t make any promises that I might not change this behavior.
00:00 Yes, exactly. And you can have double underscore in a class which will actually rename it, which is much harder to get a hold of, like name mangling. That’s real private. If you have a conflicting thing, like if you have a variable and you want to call it ‘in,’ there’s a keyword called ‘in’ like ‘key’ in dictionary. You can just say in_ SQLAlchemy has and_or_in_, so it can have things that look very similar to restricted key words, it avoids them. I then you have whole Python data model with dunder init, dunder new or representation strings and so on. All the magic methods.
00:00 there’s a few that I didn’t really know about. In the internationalization of strings, there’s some special meaning there. It’s kind of complex; I’m not going to go into it. And then in Python 3.6, you can use it as like where you would have commas for digit grouping, you can have underscore. So, like 1_000_000 is one million in Python 3.6.
00:00 Yeah, this is good also just for reading other people’s code, to understand how they’re using it.
00:00 Yeah, because you might just think, ‘Oh, that’s a weird naming convention that that person used here.’ They’re trying to communicate something very specific by using underscore in all these circumstances. So, I think it’s good to bring it up for people who maybe come from other languages.
00:00 Yeah, especially people that come from Pearl because Pearl has a very special meaning for underscore so it’s good to know that it isn’t that.
00:00 You know, one of the things that I think is frustrating is there’s a super old PyPI package. It hasn’t been updated in 5 years and it’s got like that perfect name. There should be some kind of active project to inherit its name.
00:00 There was a PEP that came out on the 12th by Donald Stuft and it’s in draft status currently. I’m sure with a few tweaks by Python core people it will probably go through. It’s called, “PEP 541: Package Index Name Retention” and it’s basically just like what you said. Given that a package name in the index share a flat name space. Unique names are a finite resource so as the Package Index is growing and growing older, we need to have a way to deal with things like old packages that aren’t updated anymore. Somebody that just has squatted on it, that hasn’t done anything with it. There’s other various things or conflict. If two people want to legitimately use it, how do we resolve those conflicts.
00:00 Yeah, an example comes to mind for me is there’s a cool package for kind of a crummy problem called Suds, so consuming SOAP web services in Python is not pretty. SOAP is like a big, gnarly XML format for web services. It’s preceded or sometimes more popular within enterprises than JSON http services. There’s a cool package called Suds that has a really great API for consuming those on the client side but it’s kind of old and outdated and only supports Python 2.x. So, a guy whose username is Jerko created a project called suds-jerko. So, if you want to have in the package name in code is just suds, but it’s basically that thing upgraded for Python 3. He couldn’t use suds, even though it was not being maintained, it owned it on PyPI.
00:00 Yeah, that’s a tough one. And even if something’s not updated, it’s hard to know if people are using it or not.
00:00 of the stuff that was addressed in this proposal is things like checking out other uses of the word, like on CPAN or NPM or GitHub. So, if it’s a very popular GitHub repository, maybe we should let it go through or something. I actually wanted to put in a package named ‘expect’ and it was already used by someone else. Then I went out and looked and it was legitimate. There was a lot of other tools that were more like that tool than what I wanted to use, so it would have been bad for me to try and throw a hissy fit and grab that.
00:00 It’s interesting. It’s cool that they are proposing a way to look at prior art and existing things.
00:00 I’m actually a little surprised it didn’t already exist, but I’m glad it’s happening.
00:00 Yep, it’s definitely a good thing. We’re about to come up on 100,000 PyPI packages. The naming is starting to become something scarce, right.
00:00 did you ever read the William Gibson books like Neuromancer and some of these crazy, futuristic, cyber-punk type books?
00:00 I kind of feel like we’re living in this sort of sci-fi future in a weird way. There was the craziest article that I’ve seen in a while come out. I’ll just read you the title. “Hackers Downloaded US Government Climate Data and Stored it on European Servers as Trump was Being Inaugurated.”
00:00 here’s the deal. I’m going to try to make this as politically neutral as possible. I just want to talk about the data side of this in a pretty interesting way. Who owns and controls data? Does the President? Does the country? Does the world own this data? The US has an incredible amount of data about the world, the Environmental Protection Agency (EPA), NASA, the Department of Energy and so on. There have been some rumors that when Trump took office, he was going to reduce, cutback or even fire some people that worked on climate change. People knew this. A few months before, maybe a month before, there were people looking around saying, these websites, like the EPA, have a bunch of data on it, and they were like, ‘What if that gets deleted?’ There was a meeting of 60 programmers and scientists at the Department of Information Studies at UCLA and their goal was to find as many of these websites with government data on it and start to scrap it and download it and exfiltrate it from the country. Isn’t that crazy?
00:00 what they did is, they all got together and they were working like 22 hours a day for the 4 or 5 days leading up to the inauguration. They had a huge spreadsheet of hundreds of sites where they’d go in and they’d try to get the data and they would pull it off. Examples include web pages dedicated to Department of Energy solar projects and initiatives or the Energy Information Administration stuff about fossil fuels compared to renewable energy and fuel cell research and those kinds of things. It turns out there’s what they call volunteer data rescue events in Toronto and Philly and Chicago, Indianapolis and Michigan, over the last weeks, trying to scrap hundreds of thousands of pages off the internet.
00:00 might say, ‘People are being crazy. C’mon.’ But it turned out, exactly at noon as Trump was sworn in and the UCLA group was actively working, all the climate change related data on the whitehouse.gov disappeared. But they had gotten a lot of it.
00:00 there’s this company called Page Freezer, and I think they’re a Canadian company. They basically create snapshots of webpages. A little bit like Internet Archive or the Wayback Machine, and it’s actually a commercial service. These guys started to participate and they’re storing all of this data somewhere in Europe.
00:00 The reason for storing it in Europe being there’s not a way the government can force it to shut down?
00:00 Yeah, exactly. I thought that was the craziest headline around programming I’ve seen this week. I wanted to include that.
00:00 I do know that there’s and again, not to be too political on this, aren’t there Legacy versions of the old versions of the White House website?
00:00 Probably on the Wayback Machine and stuff but it could be that it doesn’t necessarily store the data linked in it, like the 100 MB CSV file or whatever. That was the kind of stuff that they were actually going after.
00:00 Well, it’s good to have it saved.
00:00 It’s good to have it saved so we’ll see where it goes, but I just thought this was really interesting and I’m sure Python played a pretty key role in a lot of screen-scraping initiatives, with Beautiful Soup and Scrapy and things like that.
00:00 I think we’re going to leave it there for this week. That’s a lot of cool news for everyone.
00:00 Definitely cool. Well, thanks for talking to me this week.
00:00 You bet. Thanks for sharing stories and I’ll catch you next time.
00:00 you for listening to Python Bytes. Follow the show on Twitter via @pythonbytes and get the full show notes at pythonbytes.fm. If you have a news item you want featured, just visit pythonbyes.fm and send it our way. We’re always on the lookout for sharing something cool. On behalf of myself and Brian Okken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.