Transcript #48: Garbage collection and memory management in Python
Return to episode page view on github00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds.
00:05 This is episode 48, recorded October 18th, 2017.
00:09 I'm Michael Kennedy.
00:10 And I'm Brian Okken.
00:11 And we've got a bunch of awesome stuff lined up for you.
00:14 We're both dialing in from Portland, Oregon.
00:16 We've scoured the internet, and we're going to start with some graphs.
00:20 But before we do, let's just say really quick a thanks to DigitalOcean.
00:23 A big thanks to DigitalOcean. Thanks.
00:25 Yeah, they totally blew S3 out of the water, and they've got an awesome thing called Spaces.
00:29 We'll tell you more about it later.
00:30 Right now, I want to hear about cool graphs.
00:32 Well, I came across this last week a website called Python-graph-gallery, the Python Graph Gallery.
00:40 And it is cool.
00:42 I was describing it as graph examples times your head explodes with options.
00:49 But you've got a whole bunch of different types of graphs you want to do.
00:53 There are all sorts of different types of graphs that you see around the internet.
00:58 Basically, to help you visualize your data.
01:00 You've got kind of the standard ones, like histogram and stuff, or box plot.
01:04 But then you also have really cool ones like 2D density plots, or bubble plots, or connected graphs, or core-logram.
01:12 Yeah, there's amazing stuff here.
01:14 And they all come with iPython little scripts, right?
01:18 If you click on them, they get the details.
01:20 You dive down into exactly what you want to do.
01:23 And then you can go in, and it shows you exactly how to make those plots within Matplotlib and, I think, in iPython.
01:31 But that's the same thing, right?
01:33 But then also, some of them have, like, they'll explain how to do something, and then they have alternates.
01:39 And there's some opinion there.
01:41 Some of the graphs they don't really like.
01:42 And they'll tell you why they don't like them and what some good alternatives are.
01:46 Yeah.
01:47 Another thing that's cool about it is you go to one graph.
01:49 You're like, huh, I think I need a bar chart or something like that.
01:52 And they pull it up.
01:53 It says, these are the related ones?
01:54 And you're like, oh, this one is way cooler.
01:55 I didn't even know about it.
01:56 Maybe I haven't read the Tough Day Visualizing Information book, and I don't know of all the options, right?
02:02 And you can discover them.
02:03 I like that.
02:04 Yeah.
02:04 And then, like, it includes some of the extensions.
02:06 Like, I just dove into seeing how to do a vertical histogram.
02:10 And it mentions that you need to have the Seaborn library and use it for these.
02:16 Yeah.
02:17 It looks pretty cool.
02:18 It's great.
02:18 Yeah.
02:19 And I guess there's also some R ones out there, like an R part that's sort of tied to it somehow as well if you do R.
02:25 But, yeah, I've been thinking a lot about doing some stuff recently that would require some really cool interactive graphs.
02:31 So this definitely catches my interest.
02:33 Yeah.
02:33 All right.
02:33 So check out the Python graph gallery.
02:36 That's cool.
02:36 Moving on to the next one.
02:38 Brian, do you know what Kinesis streams are?
02:40 I don't.
02:40 I do have a Kinesis keyboard, but I don't think that's related at all.
02:43 Kinesis keyboards are wild, man.
02:45 I have the Sculpt Economic mini thing from Microsoft.
02:48 I used to have one of those.
02:49 But Kinesis streams are these things that AWS released.
02:53 And the idea is you can, like, stream tons of real-time data through it and apply filters and transformations and get additional sort of real-time insights.
03:03 So, like, under the description, it'll say things like, you can continuously capture and store terabytes of data per hour from hundreds of thousands of sources, such as web click streams, financial transactions, social media feeds, et cetera, et cetera.
03:17 So this sounds like a really cool service.
03:18 You can go sign up for it, AWS.
03:20 It looks like, at least the folks that sent in this recommendation say, look, it really requires Java right now for the API to do it.
03:27 So they felt that that was wrong.
03:29 And they created this thing called Kinesis for Python APIs talking to Kinesis streams.
03:35 How about that?
03:35 That's great.
03:36 Yeah.
03:36 So if you're out there and you've got tons of data streaming in, and especially if you're already an AWS customer, you already have an account, you already work there, maybe your apps run there, then it's really cool.
03:45 So this library does some cool stuff.
03:47 It's worked for 2.7 and 3.6.
03:50 It has a Django extension helper.
03:52 It automatically detects shard changes.
03:55 So, like, this thing can do sharding.
03:57 It'll, like, adjust for that.
03:58 It'll create checkpoints.
04:00 And even has a dummy Kinesis implementation for testing.
04:04 How about that?
04:04 That's great.
04:05 That's cool.
04:05 Yeah.
04:06 And this is an open source project, too.
04:08 So you can extend on it if you need to.
04:11 Right on.
04:11 Yeah, it's pretty new.
04:12 But check it out, and thanks for recommending Kinesis.
04:15 I forgot the guy who sent it in over Twitter, but, yeah, thank you.
04:18 It's awesome.
04:18 So one of the more mysterious things, I think, in Python, relative to, say, other languages, like C, for example, is how memory works, right?
04:27 Like, in C, I call malloc or I call free.
04:30 In Python, I just do stuff and, like, I never run out of memory.
04:34 That's kind of cool.
04:35 Yeah, it is cool.
04:36 But it has some downsides, a little bit, I guess.
04:39 Not really.
04:40 At least some complexity, right?
04:41 Yeah.
04:42 Well, and it hides that complexity from the users.
04:44 But there's, especially when you have an application or a service or something that's a long-running Python application, you kind of have to care about what's going on and make sure that you don't continually grow in memory.
04:57 There's an article that we're going to link to called Things You Need to Know About Garbage Collection in Python.
05:03 And it just came out recently.
05:04 And I sat down with a cup of coffee this morning and really read it and tried to grok it.
05:09 And I think it helped me a lot to understand how Python does.
05:12 There's two levels of garbage collection.
05:14 There's the automatic stuff that's just, if an object goes out of scope, it disappears.
05:21 And then Python can reclaim that memory.
05:24 And there's something about, like, it treats small objects, like under 512 bytes, a little different to save time.
05:31 And that's cool.
05:32 But then there's this other thing to detect loops and other dead memory.
05:37 Because reference counting, you can have objects point to each other and you can get these loops of memory that just sit around forever.
05:45 And so there's this other system, the generational garbage collector that goes through and looks for all of these dead items and cleans them out.
05:54 And that runs periodically.
05:55 But that one you can control if you need to.
05:59 If you really can't handle it going off and doing its own thing, you can turn it off and call it yourself once in a while if you need to.
06:06 What's really interesting about this is one of the benefits of, like, C or C++ really is you get total deterministic behavior.
06:15 But the drawback is you've got to manage it manually.
06:18 With reference counting, you get also totally deterministic behavior, right?
06:23 You run it many times, it's going to behave the same way exactly.
06:26 So if you're doing something as timing that really matters, that's cool.
06:28 The reference counting GCs or reference counting algorithm has the problem of cycles.
06:34 So if I have, like, a parent-child relationship, they're always going to have at least one reference because parent knows a child, child knows a parent.
06:41 So that thing's never going to go to zero and will leak.
06:43 So you have this secondary, like, mark-and-sweep garbage collector type thing that comes in.
06:46 And I think it's really interesting how they've chosen, like, this combination.
06:51 And the mark-and-sweep garbage collector is similar to, like, .NET or Java, which is that's all they have over there, right?
06:57 I didn't know.
06:58 Yeah, yeah.
06:58 Those two basically work in this generational garbage collector way, very similar.
07:02 I don't know that it's exactly the same, but it's similar for Java and .NET.
07:05 But that's not the main way it works.
07:08 So I think that that's actually pretty interesting.
07:10 I mean, the article here doesn't go into too much depth, but deep enough to where you can understand it.
07:15 And it's really, I thought, you know, I knew that you could mess around with stopping the garbage collector and – or the generational one.
07:23 And controlling that yourself.
07:26 But I didn't know how to do it.
07:27 And it's really not that complicated.
07:28 It's a few lines of code is all.
07:29 Yeah.
07:30 There's a couple of neat things about this article.
07:32 One is there are some very nice specifics, like, did you know the objects that are equal to or smaller than 512 bytes have a different allocator and mechanism, right?
07:42 Like, knowing that cutoff and those sorts of things and knowing when the GC kicks in and when to turn it off.
07:49 Like, there's also a lot of references.
07:50 Like, if you don't know more about this, read about this section.
07:52 If you don't know more about this, read about this section.
07:54 So I think this is a great place to start this exploration.
07:57 And then at the end, it talks about how to find these – you know, these cycles are bad.
08:01 And you kind of want to get those out of your code if you really want to care about this a lot.
08:05 And it talks about how to do – how to go about looking for that stuff and visualizing it so you can try to find these cycles in your code and get rid of them.
08:13 That's cool.
08:14 Yeah, and the other thing to consider when you're thinking about stuff, especially if it kicks into the actual mark and sweep cycle, garbage collector type thing, is algorithms and data structures.
08:25 So you can have a data structure that is, like, many, many objects that point at each other.
08:30 Think of, like, linked list type of things.
08:31 There's tons of work to process those if you've got ginormous ones.
08:36 It's tons of work to process and determine if that's garbage, right?
08:40 You might be able to, like, use a sparse array or something that uses almost new pointers but stores the same data and, you know, is more efficient.
08:48 So there's a lot of interesting follow-on things to explore here.
08:51 And again, yeah, and this is mostly a concern with people that have long-running Python applications.
08:56 For short-running things, it's not a problem.
09:00 So you don't really have to care about it.
09:01 Yep.
09:01 Also, another final thought is you said you can turn off the garbage collector.
09:05 I think, was it Instagram that turned off the garbage collector?
09:08 In their system, it was either, I feel like it was Instagram or Quora, one of those people, one of those companies turned off the garbage collector.
09:16 And they found they were able to get much better memory reuse on Linux across the processes and actually was better off by just letting the cycles leak.
09:25 In this article, you can determine it yourself.
09:27 You can have predetermined times where you're going to go out and let it run.
09:31 Yep.
09:31 Pretty interesting.
09:32 You know what else is interesting?
09:33 Spaces.
09:35 Yeah, it is.
09:36 Spaces is pretty awesome.
09:37 Yeah.
09:37 So like this audio you guys all are listening to came over DigitalOcean Spaces.
09:41 And if you're familiar with S3, this is like S3 but way better.
09:44 So very deterministic pricing.
09:47 You pay $5 a month for a terabyte of outbound traffic, no inbound traffic.
09:52 And beyond that, it's like one-ninth the price of bandwidth and traffic for S3.
09:59 So if you're using S3 now, definitely consider DigitalOcean Spaces.
10:02 They're doing really cool stuff there.
10:04 All the APIs, the libraries, and the tools that work against S3 also work against Spaces.
10:10 They've made that sort of a compatibility layer for them.
10:13 So I've been using it.
10:14 I really, really like it.
10:15 And I definitely encourage you to check it out at do.co slash Python.
10:19 It helps support the show.
10:20 And like I said, I think it's pretty awesome.
10:23 So let's talk about the web for a little bit.
10:25 We've, you know, many times we've touched on asynchronous programming of one variety, another threads, multiprocessing, asyncio type of things.
10:34 But the truth is that on the web, almost all of the things, all the frameworks are built in a way that cannot take advantage of that at all.
10:45 Or very, very rarely, I guess.
10:47 Because they're built upon WSGI, the web service gateway interface.
10:51 And that basically has a single serial function call for each request.
10:56 And that's that.
10:57 There's really not much of a way to expand or to change how the web processing works.
11:04 So like if you want to do maybe some async and await on like database calls or against web services, like requests, you could do that with requests, for example.
11:12 That's basically not going to have any effects that's still going to be blocking somewhere along in this WSGI request.
11:19 There's no way for the server to take advantage of that.
11:22 Some of the servers use threads, like micro-WSGI, but still, it's not nearly the same level of benefit.
11:27 So there's this article I want, or series, I guess, that's starting to come out here called WSGI is not enough anymore.
11:35 I'm referencing part one and part two, and part one really lays out the problem.
11:38 Basically, there are two problems.
11:40 One is concurrency, right, which I just described.
11:44 The other problem really is that HTTP isn't the only protocol anymore.
11:48 So things like web sockets and other multi-bidirectional communication and binary stuff is happening.
11:55 That's also not supported by WSGI, right?
11:58 So this article and series sort of explores how do we solve this with event-driven programming, and they're not quite done.
12:06 They're still working on it, but I thought it was a cool thing.
12:09 So the next session, the next thing that's coming out is talking about libraries to solve the concurrency problem in Python and then onward to the other things.
12:16 So pretty cool.
12:17 Yeah, that's very interesting.
12:19 I can't wait for the day when these things really unlock because we talk about things like async and await, and they're pretty cool, but they're really hard to make practical use of.
12:28 Once the web server requests themselves can participate in these async event loops, then it's on, right?
12:35 It just breaks open, and all sorts of amazing stuff can happen.
12:38 So I guess I didn't realize that these frameworks couldn't take advantage of web sockets, or can they with add-on libraries or something?
12:47 Yeah, you've got to set up some kind of separate server, and I can't remember what it's called, unfortunately.
12:53 But they can send it over, say, you know, we're going to upgrade this to a socket, so send it over to this separate process, like the separate server type of thing.
13:01 There's a lot of work to juggle these different protocols right now.
13:04 So yeah, it'll be nice when that's more seamless.
13:07 Well, I'll have to follow along with these.
13:08 This is great.
13:09 Yeah.
13:09 And for now, we can use things like queues even for a little asynchronous concurrency, drop off a little job and pick it back up.
13:15 I was in the looking for a queue, a last-in, first-out queue.
13:20 I needed that for a project I was working on.
13:22 I just needed it as a data structure.
13:24 I didn't have different producers and consumers.
13:27 I just had one part of the program where I was collecting stuff and another part where I had to get it out last-in, first-out.
13:34 So I was looking around, and there was an article from Dan Bader, and it's called Queues in Python.
13:41 And it's a decent – I guess I'd just forgotten about a lot of this stuff.
13:45 And it kind of goes over lists using how to use queues in Python and how to use them, how to use a list, how to use the queue library.
13:54 There's actually a queue built-in library.
13:56 And the collections deck also is something you can use.
14:01 The deck is a doubly-linked list.
14:03 And then it talks about pretty much how to use them.
14:06 And it's a pretty good article.
14:07 And it mentions that you can use all of these for last-in, first-out, but I didn't quite know how to use those.
14:13 So I went ahead and explored all the different ways to use these three as – or just a way to use these three as a last-in, first-out queue and threw it in the show notes.
14:25 Yeah, it's really cool, really simple.
14:27 I think knowing about data structures and especially knowing about the built-in ones is really valuable.
14:32 And I feel like we've been doing Python for a long time, but I still continuously learn about these things.
14:37 It's good to come back when you start using the data structures you're just using all the time and you need something else.
14:43 Going ahead and looking what's around is neat.
14:45 I was also curious about timing.
14:48 So I went ahead on a sample program and timed all these to see with, like, some huge objects I was throwing in there to see if any of them were faster or slower.
14:58 And with small objects, they're all kind of about the same.
15:03 And with large objects, it looks like the collections deck is a tad bit faster for my use.
15:09 But none of them are really out of the ballpark slower.
15:12 So to me, the deck has the best – just the best interface because you can just iterate over it when it looks cleaner.
15:19 But that was my opinion.
15:21 Yeah, that's really cool.
15:22 Thanks for pointing that out.
15:23 All right.
15:25 I want to sort of close this out with something kind of meta.
15:28 So on our podcast, I want to talk about a new podcast.
15:32 So a guy named Mark Weiss created a podcast called Using Reflection, a podcast about humans and engineering.
15:40 So he started out interviewing Jesse Davis from MongoDB, one of the main Python guys in the space.
15:47 So there's a really cool interview about him.
15:50 And if you're thinking about – you want to look at these notable people on how they've become leaders within their companies or within their industry,
15:57 and you want to sort of explore that journey with them, it's a pretty cool podcast.
16:01 So I thought I'd give a shout-out to it.
16:02 I listened to a couple episodes, and I like his interview style, and it's very conversational and laid back.
16:09 It's cool.
16:09 Yeah, it's like you just kick back, grab a coffee with the two guys, and you just don't say anything because they can't hear you.
16:14 Yeah.
16:14 You can say stuff, but they still don't hear you.
16:16 Yeah.
16:18 Yeah, who knows?
16:19 That's awesome.
16:20 All right.
16:20 So, yeah, check out Using Reflection.
16:22 It's a cool podcast.
16:23 All right.
16:24 So I guess that's it for our news this week, Brian.
16:25 Anything else you got you want to share with the people?
16:27 I got nothing this week.
16:29 No more book writing, just hanging out at the zoo now, huh?
16:31 That was fun.
16:32 If, you know, your idea of fun is trying to herd six eight-year-olds around a zoo for a day, then it was fun.
16:38 Oh, give me a tricky bug.
16:39 I'll take that instead.
16:42 Yeah, so, you know, last week I talked about, I announced my free MongoDB course at freemongodbcourse.com, and that thing has been going super well.
16:50 Like, over 5,000 people have taken that course in a week.
16:53 That's pretty amazing.
16:54 I have to admit that I was doing your longer Mongo course, and I thought I'd watch this first.
17:01 So I've started it myself.
17:03 I'm one of those signups.
17:04 Oh, cool.
17:05 You're like, I don't know what that person is.
17:08 Cool.
17:10 Very nice.
17:11 Yeah, people seem to be enjoying it.
17:12 So I'm glad everyone could take advantage of it.
17:14 I'm glad you put that out there.
17:15 It's really cool, and people should check it out.
17:17 Yeah, thanks.
17:18 All right, well, I guess until next week, Brian.
17:21 Yeah, talk to you next week.
17:22 All right, talk to you next week.
17:23 Thank you for listening to Python Bytes.
17:26 Follow the show on Twitter via at Python Bytes.
17:29 That's Python Bytes as in B-Y-T-E-S.
17:32 And get the full show notes at pythonbytes.fm.
17:35 If you have a news item you want featured, just visit pythonbytes.fm and send it our way.
17:40 We're always on the lookout for sharing something cool.
17:42 On behalf of myself and Brian Okken, this is Michael Kennedy.
17:46 Thank you for listening and sharing this podcast with your friends and colleagues.