« Return to show page
Transcript for Episode #48:
Garbage collection and memory management in Python
Michael KENNEDY: Hello and welcome to Python Bytes, where we deliver news and headlines directly to your earbuds. This is Episode #48, recorded October 18th, 2017. I’m Michael Kennedy.
Brian OKKEN: And I’m Brian Okken.
KENNEDY: We’ve got a bunch of awesome stuff lined up for you. We’re both dialing in from Portland, Oregon. We’ve scoured the Internet and we’re going to start with some graphs. But before we do, let’s just say really quick, a thanks to DigitalOcean.
OKKEN: A big thanks to DigitalOcean.
KENNEDY: They totally blew S3 out of the water and they’ve got an awesome thing called Spaces. We’ll tell you more about it later. Right now I want to hear about cool graphs.
OKKEN: I came across this last week. A website called python-graph-gallery.com, the Python Graph Gallery, and it is cool. I was describing it as graph examples x (times) your head explodes with options. It’s got all the different types of graphs you want to do. There are all sorts of graphs that you see around the Internet and to help you visualize your data.
KENNEDY: You’ve got kind of the standard ones like histogram and stuff, or connected graphs or corelegrams. Yeah, there’s amazing stuff here. And they all come with iPython scripts, right? You click on them and get the details.
OKKEN: You dive down into exactly what you want to do and then you can go in and it shows you exactly how to make those plots in Matplotlib and, I think, in iPython, but that’s the same thing, right?
But also, they’ll explain how to do something and they’ll have alternates and reasons. And there’s some opinion there, some of the graphs they don’t really like and they’ll tell you why they don’t like them and what some good alternatives are.
KENNEDY: Yeah, another thing that’s cool about it, you go to one graphic, ‘Huh, I think I need a bar chart’ or something like that. They pull up the related ones. ‘Oh, this one is way cooler. I didn’t even know about it. Maybe I haven’t read the Tufte ‘Visualizing Information’ book. I don’t know all of the options.’ You can discover them. I like that.
OKKEN: Yeah. And it includes some of the extensions. I just dove into ‘Seeing How to Do a Vertical Histogram’ and it mentions that you need to have the C-Born library and use it for these.
KENNEDY: Looks pretty cool. And I guess there was some R ones out there. An R part that’s tied to it somehow as well if you do R. I’ve been thinking a lot about doing some stuff recently that would require some cool, interactive graphs, so this definitely catches my interest. So, check out the Python Graph Gallery. That’s cool.
Moving onto the next one… Brian, do you know what Kinesis streams are?
OKKEN: I don’t. I do have a Kinesis keyboard but I don’t think that’s related at all.
KENNEDY: Those keyboards are wild. I have the sculpt ergonomic mini thing from Microsoft. I used to have one of those.
But Kinesis streams are these things that AWS released. The idea is you can stream tons of real-time data through it and apply filters and transformations and get additional real-time insight. Under the description it will say things like, ‘You can continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as web clickstreams, financial transactions, social media feeds, IT logs and location-tracking events.
This sounds like a really cool service. You can sign up for AWS. It looks like – at least the folks that sent in this recommendation say – it really requires Java right now for the API to do it, so they felt that that was wrong. So, the created this thing called pynesis for Python APIs talking to Kinesis streams. How about that?
OKKEN: That’s great.
KENNEDY: Yeah, so if you’re out there and you’ve got tons of data streaming in – especially if you’re already an AWS customer, you already have an account and you already work there, maybe your apps run there – then it’s really cool. This library does some cool stuff. It works for 2.7 and 3.6. It has a Django extension helper and it automatically detects shard changes. This thing can do sharding and it will adjust for that. It will create checkpoints and even has a dummy Kinesis implementation for testing. How about that?
OKKEN: That’s great. And this is an Open Source project, too so you can extend on it if you need to.
KENNEDY: Right on, yeah. It’s pretty new but check it out. Thanks for pynesis. I forgot the guy’s name who sent it in over, but thank you. It’s awesome.
So, one of the more mysterious things I think in Python, relative to other languages like C for example, is how memory works. I can see I call malloc or I call free. In Python I just do stuff and I never run out of memory. That’s kind of cool.
OKKEN: Yeah. It is cool, but it has some down sides. A little bit, I guess. Not really.
KENNEDY: At least some complexity, right?
OKKEN: Yeah, well it hides that complexity from the users. Especially when you have an application or a service or something that’s a long-running Python application, you kind of have to care a about what’s going on and make sure that you don’t continually grow in memory.
There’s an article that we’re going to link to called, “Things You Need to Know About Garbage Collection in Python” and it just came out recently. I sat down with a cup of coffee this morning and I really read it and tried to grok it. And I think it helped me a lot to understand.
There’s two levels of garbage collection. There’s the automatic stuff that. If an object goes out of scope and disappears then Python can reclaim that memory. And there’s something about, it treats mall objects under 512 bytes a little different to save time and that’s cool.
But there there's this other thing that detects looks and other dead memory because in reference counting you can have objects point to each other and you can get these loops of memory that just sit around forever. So, there’s this other system, the generational garbage collector (GC), that goes through and looks for all of these dead items and cleans them out. That runs periodically, but that one, you can control of you need to, if you really can’t handle it going off and doing its own thing. You can turn it off and call it yourself once in awhile if you need to.
KENNEDY: What’s really interesting is one of the benefits of C or C++ really, is total deterministic behavior. But the drawback is you’ve got to manage it manually. With reference counting, you get, also, totally deterministic behavior. You run it many times, it’s going to behave the same way exactly. So if you’re doing something that’s timing that really mattered, that’s cool. The reference counting GCs, or reference counting algorithm, has the problem of cycles. So, if I had a parent-child relationship, they’re always going to have at least one reference, because parent knows the child, child knows the parent. That thing’s never going to go to zero and will leak, so you have this secondary market sweep garbage collector-type thing that comes in. I think it’s really interesting how they’ve chosen this combination. The market sweep garbage collector is similar to .NET or Java, which that’s all they have over there, right?
OKKEN: I don’t know.
KENNEDY: Those two basically work in this generational garbage collector way, very similar – I don’t know if it’s exactly the same – but it’s similar for Java and .NET but that’s not the main way it works. But that’s actually pretty interesting.
OKKEN: The article here doesn’t go into too much depth, but deep enough to where you can understand it. I knew that you could mess around with stopping the garbage collector or the generational one and controlling that yourself, but I didn’t know how to do it. It’s really not that complicated. It’s a few lines of code is all.
KENNEDY: Yeah, there’s a couple of neat things about this article. One is, there are some very nice specifics, like did you know that objects that are equal to or smaller than 512 bytes have a different allocator and mechanism? Knowing that cutoff and those sorts of things, knowing when the GC kicks in and when to turn it off. There’s also a lot of references, like ‘If you don’t know more about this, read this section.’ I think this is a great place to start this exploration.
OKKEN: And at the end it talks about how to find these cycles that are bad. You kind of want to get those out of your code if you really want to care about this a lot. And it talks about how to go looking for that stuff and visualizing it so you can try to find these cycles in your code and get rid of them. It’s cool.
KENNEDY: Yeah. The other thing to consider when you’re thinking about this stuff, especially if it kicks in to the actual market sweep cycle, garbage collector-type thing, is algorithms and data structures. So, you can have a data structure that is like many objects that point at each other. Think of like linked list type of things. There’s tons of work to process those if you’ve got ginormous ones. Tons of work to process and determine if that’s garbage, right? You might be able to use a sparse array or something that uses almost no pointers but stores the same data and is more efficient. There’s a lot of interesting things to follow and explore here.
OKKEN: Yeah. This is mostly a concern with people who have long running Python applications. For short running things it’s not a problem. You don’t really have to care about it.
KENNEDY: Also, another final thought is you said you can turn off the garbage collector. I think, was it Instagram that turned off the garbage collector in their system? I feel like it was Instagram or Quora, one of these people. One of those companies turned off the garbage collector and they the we're able to get much better memory use on Linux across the processes and actually was better off letting the cycles leak.
OKKEN: In this article you can determine it yourself. You can have pre-determined times where you’re going to go out and let it run.
KENNEDY: Yep. Pretty interesting.
You know what else is interesting? Spaces.
OKKEN: (Laughs) Yes.
KENNEDY: Spaces is pretty awesome. This audio that you guys are listening to came over DigitalOcean’s Spaces. If you’re familiar with S3, this is like S3 but way better. So, very deterministic pricing, you pay $5 for a terabyte of outbound traffic, no inbound traffic, and beyond that it’s like 1/9th the price of bandwidth and traffic for S3. So if you’re using S3 now, definitely consider Spaces. They’re doing really cool stuff there, all the APIs, all the libraries and the tools that work in S3 also work in Spaces. They’ve made that sort of a compatibility layer for them.
I’ve been using it. I really, really like it and I definitely encourage you to check it out at do.co/python. Help support the show and, like I said, it’s pretty awesome.
Let’s talk about the web for a little bit. Many times we touched on asynchronous programming of one variety or another, threads, multiprocessing, asyncio type of things. But the truth is that on the web, almost all of the things, all the frameworks, are built in a way that cannot take advantage of that at all, or very rarely, because they’re built upon WSGI (Web Service Gateway Interface). And that basically has a single serial function call for each request and that’s that. There’s really not much of a way to expand or to change how web processing works. So, if you want to do some async and await on database calls or against web services, you can do that with requests, for example. That’s basically not going to have any effects. There’s still going to be blocking somewhere along in this WSGI requests. There’s no way for the server to take advantage of that. Some of the servers use threads like Micro WSGI but it’s not nearly the same level of benefit.
So there’s this article – or series, I guess – to come out here called, “WSGI is Not Enough Anymore.” I’m referencing part 1 and part 2. Part 1 really lays out the problem. Basically, there are 2 problems. One is concurrency, which I just described. The other problem is that HTTP isn’t the only protocol anymore. So, things like web sockets and other multi/bi-directional communication, binary stuff is happening. That’s also supported by WSGI. This article and series explores how we solve this with event-driven programming and they’re not quite done, they’re still working on it. I thought it was a cool thing. So, the next session or next thing that's coming out it is talking about libraries to solve the concurrency problem in Python and then onwards to the he other things.
OKKEN: Wow, that’s very interesting.
KENNEDY: I can’t wait for the day when these things really unlock because we talk about things like async and await. They’re pretty but really hard to make practical use of. Once the web server requests themselves can participate in these async event loops, then it’s on. It just breaks open and all sorts of amazing stuff can happen.
OKKEN: I guess I didn’t realize these frameworks couldn’t take advantage of web sockets. Or can they with add-on libraries or something?
KENNEDY: Yeah. You’ve got to set up some kind of separate server. I can’t remember what it’s called, unfortunately, but it can send it over. Like, ‘We’re going to upgrade this to a socket so send it over to this separate process, separate server-type thing.’ It’s a lot of work to juggle these different protocols right now. It will be nice when that’s more seamless.
OKKEN: I’ll have to follow along with these, this is great.
KENNEDY: And for now, we can use things like queues even, for an asynchronous concurrency. Drop off a little job and pick it back up.
OKKEN: I was looking for a queue, a last in, first out (LIFO) queue. I needed that for a project I was working on. I just needed it as a data structure. I didn’t have different producers and consumers, I didn’t have one part of the program where I was collecting stuff and another part where I had to get it out, last in, first out.
So, I was looking around and there was an article from Dan Bader and it’s called, “Queues in Python.” I guess I’ve just forgotten about a lot of this stuff. It goes over lists using, ‘How to use queues in Python’ and ‘How to use a list,’ ‘How to use a queue library,’ there’s actually a built in library. And the collections.deque also is something you can use. The deck is a doubly-linked list. And it talks about pretty much how to use them.
It’s a pretty good article and it mentions that you can use all of these for last in, first out. I didn't quite know how to use those so I went ahead and explored a way to use these three for a last in, first out queue and threw it in the show notes.
KENNEDY: Yeah, that’s really cool and really simple. I think, knowing about data structures and especially knowing about the built-in ones is really valuable. I feel like we’ve been doing Python for a long time, but I still continue to constantly learn about these things.
OKKEN: It’s good to come back when you start using the data structures you’re using all the time and you need something else. Going ahead and looking what’s around is neat.
I was also curious about timing, so I went ahead and on a sample program and timed all these with some huge objects I was throwing in there to see if any of them were faster or slower. With small objects they’re all kind of about the same and with large objects it looks like the collections.deque is a tad bit faster for my use, but none of them are really out of the ballpark slower. To me, the deque has the best interface because you can iterate over it and it looks cleaner. But that’s my opinion.
KENNEDY: Yeah, that’s really cool. Thanks for pointing that out.
I want to close this out with something kind of meta on our podcast. I want to talk about a new podcast. A guy named Mark Weiss created a podcast called Using Reflection: A Podcast About Humans Engineering. So, he started out interviewing Jesse Davis from MongoDB and one of the main Python guys in the space. There’s a really cool interview about him and if you are thinking about looking at these notable people and how they’ve become leaders within their companies or within the industry and you want to explore that journey with them, this is a pretty cool podcast. I thought I’d give a shout out to it.
OKKEN: I listened to a couple episodes and I like his interview style. It’s very conversational and laid back. It’s cool.
KENNEDY: Yeah, it’s like you just kick back and grab a coffee with the two guys and you just don’t say anything because they can’t hear you. (Laughs) Or you can say something but they still won’t hear you.
OKKEN: (Laughs) Yeah.
KENNEDY: Check out Using Reflection, it’s a cool podcast.
So, I guess that’s it for our news this week, Brian. Anything else you want to share with the people?
OKKEN: I got nothing this week.
KENNEDY: No more book writing? You’re just hanging out at the zoo now, huh?
OKKEN: That was fun. If your idea of fun is trying to herd 6 eight year-olds around a zoo for day, then it was fun.
KENNEDY: Give me a tricky bug, I’ll take that instead. (Laughs)
So, last week I announced my free MongoDB course at freemongodbcourse.com and that thing has been going super well. Over 5,000 people have taken that course in a week. That’s pretty amazing.
OKKEN: I have to admit that I was doing your longer Mongo course and I thought I’d watch this first. I started it myself. I’m one of those sign-ups.
KENNEDY: Cool, you are like, I don’t know what that percent is. (Laughs) Cool. Very nice. People seem to be enjoying it so I’m glad that everyone could take advantage of it.
OKKEN: I’m glad you put that out there. It’s really cool. People should check it out.
KENNEDY: Thanks. Alright, well, I guess until next week, Brian.
OKKEN: Yeah, talk to you next week.
KENNEDY: Talk to you next week.
Thank you for listening to Python Bytes. Follow the show on Twitter via @pythonbytes. Get the full show notes at pythonbytes.fm. If you have a news item you want featured, just visit pythonbytes.fm and send it our way. We’re always on the lookout for sharing something cool. On behalf of myself and Brian Okken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.