Transcript #48: Garbage collection and memory management in Python
Return to episode page view on github00:00 Hello and welcome to Python bites where we deliver Python news and headlines directly to your earbuds This is episode 48 recorded October 18th 2017 I'm Michael Kennedy and I'm Brian Harkin and we got a bunch of awesome stuff lined up for you we're both dialing in from Portland, Oregon, we've scoured the internet and We're gonna start with some graphs But before we do let's just say really quick a thanks to digital ocean a big thanks to digital ocean. Thanks Yeah, they totally blew s3 out of the water and they've been an awesome thing called spaces We'll tell you more about it later.
00:30 Right now I want to hear about cool graphs.
00:32 - Well, I came across this last week a website called Python-Graph-Gallery, the Python Graph Gallery.
00:40 And it is cool.
00:42 I was describing it as graph examples times your head explodes with options.
00:50 But got a whole bunch of different types of graphs you want to do.
00:54 There are all sorts of different types of graphs that you see around the internet and basically to help you visualize your data.
01:00 >> You got the standard ones like histogram and stuff or box plot, but then you also have really cool ones like 2D density plots or bubble plots, or connected graphs, or core.
01:11 Core-logram, yeah, there's amazing stuff here.
01:14 They all come with IPython little scripts, right?
01:19 You click on them, they get the details.
01:20 >> You dive down into exactly what you want to do, and then you can go in and it shows you exactly how to make those plots within Matplotlib and in I think in IPython, but that's the same thing, right?
01:33 But then also some of them have, like they'll explain how to do something and then they have alternates and there's some opinion there.
01:41 Some of the graphs they don't really like and they'll tell you why they don't like them and what some good alternatives are.
01:46 - Yeah, another thing that's cool about it is you go to one graph, you're like, huh, I think I need a bar chart or something like that and it pulls up and says, these are the related ones.
01:54 And you're like, oh, this one is way cooler.
01:55 I didn't even know about it.
01:56 Like I haven't, maybe I haven't read the Tufte like visualizing information book.
02:01 And I don't know of all the options, right?
02:02 And you can discover them.
02:03 I like that.
02:04 - Yeah, and then like it includes some of the extensions.
02:07 Like I just dove into seeing how to do a vertical histogram and it mentions that you need to have the Seaborn library and use it for these, so.
02:17 - Yeah, it looks pretty cool.
02:18 - It's great.
02:19 - Yeah, and I guess there's also some R ones out there like an R part that's sort of tied to it somehow as well if you do R.
02:25 But yeah, I've been thinking a lot about doing some stuff recently that would require some really cool interactive graphs.
02:31 So this definitely catches my interest.
02:33 - Yeah.
02:34 - All right, so check out the Python graph gallery.
02:36 That's cool.
02:37 Moving on to the next one.
02:38 Brian, do you know what Kinesis Streams are?
02:40 - I don't.
02:41 I do have a Kinesis keyboard, but I don't think that's related at all.
02:43 - Kinesis keyboards are wild, man.
02:45 I have the Sculpt Economic Mini thing from Microsoft.
02:49 I used to have one of those.
02:50 But Kinesis Streams are these things that AWS released.
02:53 And the idea is you can like stream tons of real time data through it and apply filters and transformations and get additional sort of real time insight.
03:03 So like under the description, it'll say things like you can continuously capture and store terabytes of data per hour from hundreds of thousands of sources, such as web click streams, financial transactions, social media feeds, et cetera, et cetera.
03:17 So this sounds like a really cool service, you can go sign up for it, AWS.
03:20 It looks like at least the folks that sent in this recommendation say, look, it really requires Java right now for the API to do it. So they felt that that was wrong. And they created this thing called Pionesis for Python, API is talking to Kinesis streams. How about that? That's great. Yeah. So if you're out there, and you got tons of data streaming in, and especially if you're already an AWS customer, you already have an account, you already work there, maybe your apps run there, then it's really cool. So this library does some cool stuff. It's worked for 27 and 36. It has a Django extension helper, it automatically detects shard changes to like this thing can do sharding and like adjust for that it'll create checkpoints and even has a dummy kinesis implementation for testing.
04:04 How about that?
04:05 That's great.
04:06 That's cool.
04:07 Yeah, and it's a this is an open source project too.
04:09 So you can extend on it if you need to write on.
04:11 Yeah, it's pretty new, but check it out.
04:13 And thanks for recommending pinesis.
04:15 I forgot the guy who sent it in over Twitter.
04:17 But yeah, thank you.
04:18 That's awesome.
04:19 more mysterious things I think in Python relative to say other languages like C for example is how memory works right like in C I call malloc or I call free in Python I just do stuff and like I never run out of memory that's kind of cool yeah it is cool but it has some downsides a little bit I guess not really at least some complexity right yeah well it hides that complexity from the users but there's especially when you have an application or a service or something that's a long running Python application, you have to care about what's going on and make sure that you don't continually grow in memory.
04:57 There's an article that we're going to link to called Things You Need to Know About Garbage Collection in Python.
05:03 It just came out recently and I sat down with a cup of coffee this morning and really read it and tried to grok it.
05:09 I think it helped me a lot to understand how Python does.
05:13 There's two levels of garbage collection.
05:15 There's the automatic stuff that's just, If an object goes out of scope, it disappears and then the Python can reclaim that memory.
05:24 And there's something about like it treats small objects like under 512 bytes a little different to save time.
05:31 And that's cool.
05:32 But then there's this other thing to detect loops and other dead memory because reference counting you can have two objects point to each other and you can get these loops of memory that just sit around forever.
05:44 And so there's this other system, the generational garbage collector that goes through and looks for all of these dead items and cleans them out and that runs periodically.
05:56 But that one you can control if you need to. If you really can't handle it going off and doing its own thing, you can turn it off and call it yourself once in a while if you need to.
06:06 What's really interesting about this is one of the benefits of like C or C++ really is you get total deterministic behavior.
06:16 But the drawback is you got to manage it manually.
06:18 With reference counting, you get also totally deterministic behavior, right?
06:23 You run it many times, it's going to behave the same way exactly.
06:26 So if you're doing something as timing that really matters, that's cool.
06:29 The reference counting GCs or reference counting algorithm has the problem of cycles.
06:34 So if I have like a parent-child relationship, they're always going to have at least one reference because parent knows a child, child knows parent.
06:41 So that thing's never going to go to zero and will leak.
06:43 So you have this secondary like market sweep garbage collector type thing that comes in.
06:47 And I think it's really interesting how they've chosen like this combination and the market sweep garbage collector, similar to like .NET or Java, which is that's all they have over there, right?
06:58 I didn't know.
06:59 Yeah, yeah.
07:00 Those two basically work in this generational garbage collector way very similar.
07:02 I don't know that it's exactly the same, but it's similar for Java and .NET.
07:05 But that's not the main way it works.
07:08 So I think that that's actually pretty interesting.
07:11 I mean, the article here doesn't go into too much depth, but deep enough to where you can understand it.
07:15 And it's really, I thought, you know, I knew that you could mess around with stopping the garbage collector and, or the generational one, and controlling that yourself, but I didn't know how to do it.
07:27 And it's really not that complicated. It's a few lines of code is all.
07:29 Yeah, there's a couple of neat things about this article.
07:32 One is there are some very nice specifics, like did you know the five objects that are equal to or smaller than 512 bytes have a different allocator and mechanism, right?
07:43 Like knowing that cutoff and those sorts of things and knowing when the GC kicks in and when to turn off.
07:49 Like there's also a lot of references, like if you don't know more about this, read about this section.
07:52 You don't know more about this, read about this section.
07:55 So I think this is a great place to start this exploration.
07:57 - And then at the end, it talks about how to find these, you know, these cycles are bad And you kind of want to get those out of your code if you really want to care about this a lot.
08:05 And it talks about how to do, how to go about looking for that stuff and visualizing it so you can try to find these cycles in your code and get rid of them.
08:14 That's cool.
08:15 - Yeah, the other thing to consider when you're thinking about stuff, especially if it kicks into the actual market sweep cycle, garbage collector type thing, is algorithms and data structures.
08:25 So you can have a data structure that is like many, many objects that point at each other.
08:30 think of like linked list type of things, there's tons of work to process those if you got ginormous ones.
08:36 You got tons of work to process and determine if that's garbage, right?
08:40 You might be able to like use a sparse array or something that uses almost no pointers but stores the same data and you know is more efficient.
08:48 So there's a lot of interesting follow on things to explore here.
08:51 And again, yeah, and this is mostly a concern with people that have long running Python applications for short-running things it's not a problem so you don't really have to care about it. Yep. Also another final thought is you said you can turn off the garbage collector. I think was it Instagram that turned off the garbage collector in their system? It was either I feel like it was Instagram or Quora.
09:12 One of those people, one of those companies turned off the garbage collector and they found they were able to get much better memory use reuse on Linux across the processes and actually was better off by just letting the cycles leak in this article you can Determinate yourself you can have predetermined times where you're gonna go out and let it run. Yep pretty interesting You know what else is interesting?
09:34 spaces Yeah, it is space is pretty awesome. Yeah, so like this audio you guys all are listening to you came over Digital ocean spaces and if you're familiar with s3 this is like s3 but way better so Very deterministic pricing you pay $5 a month for a terabyte of outbound traffic. No inbound traffic and Beyond that it's like 1/9 the price of Bandwidth and traffic for s3. So if you're using s3 now definitely consider digital ocean spaces They're doing really cool stuff there all the API is the libraries and the tools that work against s3 also work against spaces They've made that sort of a compatibility layer for them So I've been using it really really like it and you know I definitely encourage you to check it out at do dot co slash Python help support the show And like I said, I think it's pretty awesome.
10:23 So let's talk about the web for a little bit.
10:25 We've, you know, many times we've touched on asynchronous programming of one variety and other threads, multi-processing, async I/O type of things.
10:35 But the truth is that on the web, almost all of the things, all the frameworks are built in a way that cannot take advantage of that at all, or very, very rarely, I guess, because they're built upon WSGI.
10:50 web service gateway interface. And that basically has a single serial function call for each request. And that's that there's really not much of a way to expand or to change how the web processing works. So like, if you want to do maybe some async and await on like database calls, or against web services, like requests, you could do that with requests, for example, that's basically not going to have any effects, it's still going to be blocking somewhere along in this WSGI request. There's no way for the server to take advantage of that.
11:23 Some of the servers use threads like micro WSGI, but still it's not nearly the same level of benefit. So there's this article I want to series, I guess, that's starting to come out here called WSGI is not enough anymore. I'm referencing part one and part two and part one really lays out the problem. Basically, there are two problems. One is concurrency, right, which I just described. The other problem really is that HTTP isn't the only protocol anymore. So things like web sockets and other multi bidirectional communication and binary stuff is happening. That's also not supported by WSGI.
11:57 So this article and series sort of explores how do we solve this with event-driven programming and they're going to... they're not quite done.
12:06 They're still working on it but I thought it was a cool thing. So the next session, the next thing that's coming out is talking about libraries to solve the a concurrency problem in Python and then onwards to the other things.
12:17 So pretty cool.
12:17 - Yeah, that's very interesting.
12:19 - Yeah, I can't wait for the day when these things really unlock, because we talk about things like async and await, and they're pretty cool, but they're really hard to make practical use of.
12:28 Like once the web server requests themselves can participate in these async event loops, then it's on, right?
12:36 Like it just breaks open and all sorts of amazing stuff can happen.
12:38 - So I guess I didn't realize that these frameworks couldn't take advantage of web sockets or can they with add-on libraries or something?
12:47 - Yeah, you've got to set up some kind of separate server, I can't remember what it's called unfortunately, but then it can send it over, say, you know, we're going to upgrade this to a socket, so send it over to the separate process, like the separate server type of thing.
13:01 There's a lot of work to juggle these different protocols right now.
13:05 So yeah, it'll be nice when that's more seamless.
13:07 - Well, I'll have to follow along with these, this is great.
13:09 And for now, we can use things like queues even for a little asynchronous concurrency, drop off a little job and pick it back up.
13:15 - I was in the looking for a queue, a last in, first out queue.
13:20 I needed that for a project I was working on.
13:23 I just needed it as a data structure.
13:24 I didn't have different producers and consumers.
13:27 I just had one part of the program where I was collecting stuff and another part where I had to get it out last in, first out.
13:35 So I was looking around and there was an article from Dan Bader and it's called queues in Python.
13:41 And it's a decent, I guess I'd just forgotten about a lot of this stuff.
13:45 And it kind of goes over lists, using how to use queues in Python and how to use them.
13:51 How to use a list, how to use the queue library.
13:54 There's actually a queue built-in library.
13:56 And the collections deck also is something you can use.
14:01 The deck is a doubly linked list.
14:03 And then it talks about pretty much how to use them.
14:06 And it's a pretty good article.
14:07 And it mentions that you can use all of these for last and first out, but I didn't, I didn't quite know how to use those.
14:14 So I went ahead and explored all the different ways to use these three, as or just a way to use these three as a last and first out queue, and threw it in, in the show notes.
14:25 So yeah, it's really cool, really simple.
14:27 I think, you know, knowing about data structures, and especially knowing about the built in ones is really valuable.
14:32 And I feel like we've been doing Python for a long time, but I still continuously learn about these things.
14:37 >> It's good to come back when you start using the data structures you're just using all the time and you need something else, going ahead and looking what's around is neat.
14:46 I was also curious about timing.
14:48 So I went ahead on a sample program and timed all these to see with like some huge objects I was throwing in there to see if any of them were faster or slower.
14:59 And with small objects, they're all kind of about the same.
15:03 with large objects, it looks like the the collections deck is a tad bit faster for my use, but none of them are really out of the ballpark slower. So to me, the deck has the best just the best interface because you can just iterate over it when it looks cleaner. But that was my opinion. Yeah, that's really cool. Thanks for for pointing that out. All right, I want to sort of close this out with something kind of meta. So on our podcast, I want to talk about a new podcast. So a guy named Mark Weiss created a podcast called using reflection, a podcast about humans and engineering. So he started out interviewing Jesse Davis from MongoDB, one of the main Python guys in the space. So there's a really cool interview about him.
15:50 And if you're thinking about you want to look at these notable people on how they've become leaders within their companies or within their industry, and you want to sort of explore that journey with him. It's a pretty cool podcast so I thought I'd give a shout out to it.
16:02 I listened to a couple episodes and I like his interview style and it's very conversational and laid back. It's cool.
16:09 Yeah, it's like you just kick back, grab a coffee with the two guys and you just don't say anything because they can't hear you.
16:14 Yeah.
16:15 You can say stuff but they still don't hear you.
16:16 Yeah.
16:17 Yeah, who knows? That's awesome. All right, so yeah, check out Using Reflection. It's a cool podcast. All right, so I guess that's it for our news this week, Brian. Anything else you got you want to share with the people?
16:28 I got nothing this week.
16:29 No more book writing, just hanging out at the zoo now, huh?
16:32 That was fun.
16:32 If your idea of fun is trying to herd six eight-year-olds around a zoo for a day, then it was fun.
16:38 Give me a tricky bug.
16:39 I'll take that instead.
16:42 Yeah, so last week I talked about I announced my free MongoDB course at freemongodbcourse.com.
16:49 And that thing has been going super well.
16:51 Over 5,000 people have taken that course in a week.
16:53 That's pretty amazing.
16:54 I have to admit that I was doing your longer Mongo course, and I thought I'd watch this first, so I've started it myself.
17:03 I'm one of those sign-ups.
17:05 - Oh, cool, you're like, I don't know what that percent is.
17:09 So, cool, very nice.
17:11 Yeah, people seem to be enjoying it, so I'm glad everyone could take advantage of it.
17:14 - I'm glad you put that out there.
17:15 It's really cool, and people should check it out.
17:17 - Yep, thanks.
17:19 All right, well, I guess until next week, Brian.
17:21 - Yeah, talk to you next week.
17:22 - All right, talk to you next week.
17:24 Thank you for listening to Python Bytes.
17:27 Follow the show on Twitter via @pythonbytes, that's Python Bytes as in B-Y-T-E-S.
17:33 And get the full show notes at pythonbytes.fm.
17:36 If you have a news item you want featured, just visit pythonbytes.fm and send it our way.
17:41 We're always on the lookout for sharing something cool.
17:43 On behalf of myself and Brian Okken, this is Michael Kennedy.
17:46 Thank you for listening and sharing this podcast with your friends and colleagues.