Transcript #235: Flask 2.0 Articles and Reactions
Return to episode page view on github00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds.
00:05 This is episode 235, recorded May 26, 2021.
00:10 And I'm Brian Okken.
00:11 I'm Michael Kennedy.
00:12 And I'm Vincent Warmerdam.
00:14 We talked about Vincent a while ago and got his name wrong.
00:17 And he told us a story that was good.
00:19 That we accidentally pronounced his name, what, Wonderman.
00:25 Yes.
00:27 So sorry about that.
00:28 That's fine. It's fine. I was bragging to my wife that I was on the podcast and then I was announced as Vincent Wonderman And she's still kind of philosophical about the whole thing, but it was a fun introduction. It's the best Best best mispronunciation of my life. Let me put it that way. It's it's your alter ego. It's like your spy name I'll take it Well, thanks for joining us today. My pleasure. So we jump into the first step topic. Sure. Okay Well, I think we we covered we mentioned last time that flask 2.0 was out and and then Michael had done you had you talked with somebody didn't you I Did I had David Lord and also?
01:09 Philip Jones on talk Python you basically announced blast 2o and talk about all their features Yeah, and and that was a great episode listen to both those listen. That was great The what I wanted to cover was a couple articles or an article of video. So First off, we've got a link to the change list.
01:31 Actually, I lost the change list.
01:34 Yeah, there it is. You can read through that, and maybe that's exciting to you, but I like a couple of other ways.
01:41 There's an article by Patrick Kennedy, Async and Flask 2.0.
01:48 I really like this article.
01:50 It goes through describing what it means to have async in Flask and how it works with some nice little diagrams.
01:58 >> Diagrams are always nice.
02:01 >> Yeah.
02:01 >> Yeah.
02:02 >> Yes.
02:02 >> Pictures.
02:03 >> Yes.
02:04 >> Then description of the ASGI and why we don't need it yet.
02:11 I'm not sure what the timeline is for Flask if they're going to do it more, but there is a discussion of that it's not completely async yet.
02:22 - There was a lot of discussion with David and Philip that they may be leaving court to take the place of full-on ASGI Flask.
02:33 And the idea being that there's a lot of stuff that kind of has to change, especially around the extensions, and you get nearly that, but not exactly that, by using the gevent async stuff that's in regular Flask, and that integrates in, if you just do an async def method in your regular Flask.
02:50 But if you want true asyncio integration, then they basically were saying for the time, for the foreseeable future, instead of import Flask and go in that, just import court and wherever you see Flask, replace it with the word court.
03:03 - Okay.
03:04 But there's other cool stuff other than the async that's coming into Flask 2.0.
03:10 So I appreciate it.
03:11 There's also a video from, we don't want it to play, from Miguel Grinberg and talking about some of the new stuff in Flask.
03:23 And I really like this.
03:25 One of the things that he covers right away is the new route decorators.
03:30 And-- - Yeah, those are nice.
03:31 - Might be just a syntax thing, but it's really nice.
03:34 So you used to have to say app route and then methods equals post or list the method.
03:40 And now you can just say app post.
03:42 That's nice.
03:43 Then a really clean discussion of the WebSocket support with Flask.
03:49 Then he goes in to talk about the async.
03:52 With that also does a little demo timing it.
03:55 I was actually surprised at how easy it was to set up this demo of timing and showing that.
04:05 He showed that you can increase the users and still get, it doesn't really increase your response time or how many users per request per second doesn't increase because of the way that Flask 2.0 is done.
04:21 But it was nice.
04:22 And then he also talked about some of the extensions that he wrote to that work with Flask 2.0 and stuff.
04:29 So it was definitely worth the listen.
04:31 - Oh, that's always cool.
04:32 That's always the thing when you get like, Flask is like a pretty big project.
04:35 So when there's like a new upgrade of that, one of the things that people sometimes forget is like, "Oh, well, all the plugins, do they kind of still work?" So it's nice if someone does a little bit of the homework there, and says, "Well, here's a list of stuff that I've checked, and that's at least compatible." Well, he's mostly doing some...
04:49 So, for instance, one of the things is around which...
04:53 I don't know, which...
04:54 No, just some of the WebSocket stuff has changed, and some of the other things have changed.
04:58 And he has some different shims that he was recommending some things before, but now you don't have to swap out some things.
05:08 So like for instance, some of the extensions we're allowing for WebSockets required you to swap out the server for a different server and you don't have to anymore.
05:17 - Ah, like that, right. - So this is something that changes.
05:19 - Okay, cool. - Yeah, a couple big other things that come to mind.
05:23 One, they've dropped Python 2 support and even 3.5 and below.
05:27 I mean, we're at this point where 3.5 is like old school legacy, which surprises me.
05:32 - That still feels new. - Yeah, I remember when it came out.
05:34 - Yeah, yeah, well, that was when async and await arrived.
05:38 So that was the big deal there.
05:40 But it doesn't have fstring, so it's--
05:42 - Yeah, that's the killer feature.
05:44 - Yeah, so there's that.
05:48 And they also said that you are not going to need to change your deployment infrastructure if you want to run async flask.
05:54 You can just push a new version and it's good to go.
05:57 So yeah, a lot of neat things there.
05:59 Very good.
06:01 - Nice.
06:02 What do we got next, Michael?
06:04 Well, what if Python were faster?
06:07 That would be nice.
06:08 - That's always good.
06:08 - We actually talked about Sender.
06:10 Remember Sender?
06:11 - Yeah.
06:12 - From the Facebook world.
06:14 So that's one really interesting thing that is happening around Python.
06:19 And there's a lot of cool stuff here, but remember, this is not supported.
06:22 It's not meant to be a new runtime.
06:26 Just there to give ideas and motivation and examples and basically to run Instagram.
06:31 On the other hand, Mike Driscoll tweeted out, "Hey, Python might get a two-time speed up "of the next version of Python." And you might wanna check out Guido's slides from the Python Language Summit at the Virtual PyCon.
06:44 That's exciting, right?
06:45 - Yes.
06:46 I mean, if Guido is saying it, then odds of it happening increase, right?
06:52 - Exactly, exactly.
06:54 So a while ago, we actually covered what has now become known as the Shannon plan for making Python faster a little bit each time over five years, over the next four, at least I guess four years at that point, and how to make that happen.
07:08 So some of these ideas come from there.
07:10 And so here I'm pulling up the slides and it says, can we make CPython faster?
07:16 If so, by how much?
07:17 Could it be a factor of two?
07:19 Could it be a factor of 10?
07:20 And do we break people if we do things like this?
07:22 So the Shannon plan, which was posted last October and we covered, talks about how do we make it 1.5 times faster each year, but do that four times.
07:32 And because of compounding performance, I guess, that's five times faster.
07:37 All right, so there's that.
07:39 Guido said, "Thank you to the pandemic.
07:41 Thank you to boredom.
07:42 I decided to apply at Microsoft." And shocker, they hired him.
07:45 So as part of that, it's kind of just like, hey, we think you're awesome.
07:51 Why don't you just pick something to work on that will contribute back?
07:54 That'd be really cool.
07:55 So his project at Microsoft is around making Python faster, which I think is great.
08:00 - Cool.
08:01 - So yeah, so there's a team of folks, Mark Shannon, Eric Snow, and Guido, and possibly others, who are working with the core devs at Microsoft to make it faster.
08:12 It's really cool.
08:13 Everything will be done on the public GitHub repo.
08:16 There's not like a secret branch that will be then dropped on it.
08:19 So it's all just gonna be PRs to github.com/python/cpython, whatever the URL is, the public spot.
08:27 And one of the main things they wanna do is not break compatibility.
08:31 So that's important.
08:33 Also said, what things could we change?
08:37 Well, you can't change the base object, like pi, what is it, pi obj, basically the base class, right?
08:44 Pi object pointer, that's it, the pi object class.
08:46 So that thing has to stay the same and it really needs to keep reference counting semantics 'cause so much is built on that.
08:52 but they could change the bytecode that exists, the stack frame layout, the compiler, the interpreter, maybe make it a JIT compiler to JIT compile the bytecode, all of those types of things.
09:03 So that's pretty cool.
09:05 They said, how are we gonna reach two times speed up in 3.11, an adaptive specialized bytecode interpreter that will be more performant around certain operations, optimize frame stacks, faster calls, zero overhead exception handling, and things like integral internals, so maybe treating numbers differently, changing out PYC files.
09:27 So there's a lot of stuff going on.
09:28 Also, putting the dunderdick for a class always at a certain known location 'cause anytime you access a field, you have to go to the dunderdick, get the value out, and then read it.
09:40 And I suspect the first thing that happens is, well, go find the dunderdick pointer and then go get the element out of it.
09:47 So if every access could just go, nope, it's always one certain byte off in memory from where the class starts.
09:54 That would save that sort of traversal there.
09:57 So some pretty neat things.
09:59 - Yeah, I'm glad you explained that, 'cause I read it before and I'm like, why would that help at all?
10:03 - I think you can traverse one fewer pointers.
10:07 In general it doesn't matter, but literally everything you ever touch, ever, if you could cut in half the number of pointers, you gotta follow that, be good.
10:15 - Yeah, this is always one of those things that always struck me with.
10:19 When you're using Python, you don't think about these sorts of things.
10:21 It's when you're doing something in Rust or something, then you are confronted with the fact that you really have to keep track of where's the pointer pointing and memory and all that.
10:28 You take a lot of stuff for granted, so it's great that people are still sort of going at it and looking for things to improve there.
10:33 - Yeah, absolutely.
10:35 You know, in C, you do the arrow, you know, dash greater than sort of thing.
10:39 Every pointer, so you're like, "I'm following a pointer, I'm following a pointer." Like, you know it, right?
10:42 Here, you just write nice, clean code and magic happens.
10:46 (laughing)
10:47 So let me round this out with who will benefit.
10:49 So who will benefit if you're running CPU intensive pure Python code, that will get faster because the Python execution should be faster.
10:57 Websites should be faster because a lot of that code is running in the Python space.
11:02 And what does that happen to use Python?
11:04 Who will not benefit so much?
11:06 NumPy, TensorFlow, Pandas, all the code that's written in C, things that are IO bound.
11:11 So if you're waiting on something else, speeding up the part that goes to wait, doesn't really matter.
11:16 multi-threaded code because of the GIL at this point.
11:19 But Eric Snow is also working on the sub-interpreters which may fix that and so on.
11:23 So anyway, pretty neat stuff.
11:26 There's some peps out there.
11:28 I'll link to the tweet by Mike Driscoll but that'll take you straight to the GitHub repo which has the PDF of the slides and people can check that out if they're interested.
11:39 - I like the last bullet for the previous slide that thinks people that will not benefit code that's algorithmically inefficient.
11:46 Otherwise, if your code already sucks, it's not going to be better.
11:49 - It may be better, but it could be better.
11:53 - I was about to say, theoretically, it actually would go faster and just...
11:56 - Just not as much better as it could, right?
11:59 - Yeah, it would still be like n to the power of three or something like that, but it would be faster n to the power of three.
12:05 - Yeah, yeah, it won't change the big O notation, but it might make it run quicker on wall time.
12:10 That's right. - Yeah.
12:11 - Yeah, and Christopher Tyler out there in the livestream I know I still need to improve my code, but this would be great, right?
12:18 I mean, it used to be that we could just wait six months, a new CPU would come out that's like twice as fast as what we ran on before.
12:25 Like, oh, now it's fast enough, we're good.
12:27 That doesn't happen as much these days.
12:28 So it's cool that the run times are getting faster.
12:30 - Yeah, and I mean, let's be honest, Python is also still used for like just lots of script tasks like, hey, I just need this thing on the command line that does the thing, and I put that in Chrome.
12:39 And like a lot of that will be nice if that just gets a little bit faster.
12:42 And it sounds like this will just be right up that alley.
12:45 Yeah.
12:45 And one of the things that I know has been holding certain types of changes back has been concerned about slowing down the startup time, because if all you want to do is run Python to make a very small thing happen, but like there's a big jet overhead and all sorts of stuff, and it takes two seconds to start and a nanosecond, microsecond to run, right, they don't want to put those kinds of limitations and heal that use case either.
13:09 So yeah, it's, it's good to point that out.
13:11 All right, Vincent, you're up next.
13:12 Cool. Yeah, so I dabble a little bit in fairness algorithms.
13:17 It's a big, important thing.
13:19 So I get a lot of questions from people like, "Hey, if I want to do machine learning and fairness, where should I start?" And I don't think you should start with algorithms.
13:26 Instead, what you should do is you should go check out this Python project called Deon.
13:30 And the project's really minimal. The main thing that it really just does is it gives you a checklist of just stuff to check before you do a big data science project at a big company or an enterprise or something like that.
13:41 And they're really sensible things.
13:44 They're sort of grouped together.
13:46 So like, hey, can I check off that I have informed consent and collection bias?
13:51 Can I check all of these things off?
13:52 The main themes are--
13:53 - And it's literally a checkbox.
13:54 You can check them off in the page to sort of get the feel of it.
13:57 Like, oh yeah, these are good.
13:58 - It goes further.
13:59 So the thing is, this is an actual Python project.
14:01 You can generate this as YAML for your GitHub profile.
14:03 So like for your GitHub project, you actually have this checklist that has to be checked in Git so you know that people signed off on it.
14:09 Like you can actually see the checklist.
14:11 you can even maybe in your Git log see who checked it off.
14:14 But what's really cool is two things.
14:16 Like one, you can generate this checklist.
14:18 Two, you can also customize the checklist.
14:20 So if you are at a specific company of certain legal requirements, this tool actually kind of makes it easy to customize this very specific checklist for data projects.
14:27 But the real killer feature, if you ask me, like again, all of these comments are good.
14:31 Like, is the data security well done?
14:35 Is the analysis reproducible?
14:37 How do we do deployment?
14:38 Like all of these things that are usually like things that go wrong and were obvious in hindsight.
14:42 But the real killer feature is, usually you have to convince people to take this serious.
14:46 So what the website offers is like an example list.
14:49 So for every single item that is on this checklist, they have one or two examples, typically these are like newspaper articles, of places where this has actually gone wrong in the past.
14:59 So if you need like a really good argument for your boss, like, "Hey, we got to take this serious," there's a newspaper article you can just send along as well.
15:06 - That's interesting.
15:08 - Yeah, I like it.
15:10 - Yeah, and the fact you can also generate Jupyter Notebooks with this, you can customize it a little bit.
15:15 The people that made this, the company I think is called Driven Data.
15:18 They host Kaggle competitions for good causes.
15:20 That's sort of a thing that they do there.
15:22 But Deon is just a really cool project.
15:24 I think if more people would just start with a sensible checklist and work from there, a lot of projects would immediately be better for it.
15:32 - Yeah, this is really cool.
15:33 So things are, can you go to the very bottom of that page?
15:36 There you are.
15:37 - Yeah, sorry, just the checklist.
15:39 - Oh, right, yeah, yeah.
15:40 - So there's some examples like, make sure that you've accounted for unintended use.
15:45 Have you taken steps to identify and prevent unintended uses and abuse?
15:49 So like, if you created up, find my friends in pictures.
15:53 So like, I wanna find pictures my friends have taken of me.
15:55 You could put it up and it would show you all the pictures your friends took, but maybe someone else is gonna use that to, I don't know, try to fish you.
16:02 Like, here's the picture of us together, or I don't know, some weird thing, right?
16:06 use it for like facial recognition and tracking when it had no such intent, right? Things like that.
16:11 I think for an, I might be, so it's, it doesn't have this example, the best example of unintended use.
16:17 There used to be this geo lookup company where you could give an IP address and would give you like an actual address.
16:22 However, sometimes you don't know where the IP address actually is.
16:25 So just give like the center point of like a us state or the country.
16:28 So there used to be this house in the middle of Kansas, I think it was like The center point, but the thing is, they will get FBI trucks driving by and doing raids and stuff because they thought there were criminals there.
16:41 Because the geo lookup servers would always say, "Ah, the crooks at that IP address, that's this latitude longitude place." Right, right.
16:47 We had a cyber attack.
16:48 It was from this IP address.
16:49 Yeah.
16:50 "Raid 'em, boys." And of course, it was just some poor farmer in the Midwest going, "You know, just the geographic center.
16:59 Please stop raiding my farm." Yeah, but the story was actually quite serious.
17:03 Like I think the person who lived there could like death threats at some point as well because of the same mistake.
17:07 So it's like this stuff to take serious.
17:09 The one thing that I did like is the solution.
17:11 I think now that instead of it pointing to the house in Kansas, I think it points to like the center of the three big lakes in Michigan.
17:21 I think there's just the middle of a puddle of water basically just to make it obvious to the FBI squads that like, no, it's not a person living there.
17:28 Yeah, but like, darn, these submarines are moved underwater, or whatever. But I mean, but that's why you want to have a checklist like this.
17:36 Like you're not going to, the thing with unintended use is you it's unintended.
17:39 So you cannot really imagine it, but you at least should do the exercise.
17:43 And that's what this list does in a very sensible way.
17:46 And more people should just do it.
17:47 And there's interesting examples, too.
17:49 You just have a look.
17:50 And there's also a little community.
17:52 There's a little community around it as well, like collecting these examples.
17:55 And they have like a wiki page with examples of them to make the front page cut.
17:58 So definitely recommend anyone interested in fairness, start here.
18:01 I was curious, you brushed by it fairly quickly, fairness analysis? Is that what you do?
18:09 I just don't know what that means.
18:13 Yeah, so, man, this is a long...
18:16 This topic deserves more time than I'll give it, but the idea is that you might be able...
18:21 We know that models aren't always fair, right?
18:24 It can be that you have models that, for example, The Amazon was a nice example. So they had like a resume parsing algorithm that basically favored men because they hired more men historically. So the algorithm would prefer men.
18:38 Oh, okay. That kind of fairness. Okay.
18:40 Historical, these have been our good employees. Let's find more like them.
18:44 Exactly. And the thing is, you get an algorithm that's unfair. So there are these machine learning techniques and there's this community of researchers that try to look for ways, like, "Can we improve the fairness of these systems?" So we don't just optimize for accuracy.
18:57 We also say, "Well, we want to make sure that subgroups are treated fairly and equally," and stuff like that. So I dabble a little bit in this.
19:03 There's this project I like to collaborate with.
19:05 I open-source a couple of things with these people.
19:08 It's called Fairlearn. The main thing that I really like about the package is that it starts by saying "Fairness of AI systems is more than just running a few lines of code." It starts by acknowledging that.
19:19 But they have mitigation techniques and algorithms tools to help you measure the unfairness. It's like learn compatible as well, stuff to like, having said all that, start here. Start with a checklist. Don't worry about the machine learning stuff just yet. Start here. - Yeah, very cool. Before we move on, Connor Furster in the live chat says, "I'm glad the conversation of ethics and data science is enlarging. I think it's important about what we make." Yeah, I agree. Now, before we do move on, Let me tell you all about our sponsor for this episode, Sentry.
19:51 So this episode is brought to you by Sentry.
19:53 Thank you, Sentry.
19:54 How would you like to remove a little bit of stress from your life?
19:57 Do you worry that users may be having difficulties or encountering errors with your app right now?
20:02 And would you even know it until they sent you that support email?
20:05 How much better would it be to have the errors and performance details immediately sent to you, including the call stack and values of local variables and the active user recorded right in the report?
20:14 With Sentry, it's not only possible, it's simple.
20:18 We actually use Sentry on our websites, it's on pythonbytes.fm, it's on Talk Python Training, all those things, and we've actually fixed a bug triggered by a user and had the upgrade ready to roll out as we got the support email.
20:30 They said, "Hey, I'm having a problem with the site, "I can't do this or that." I said, "Actually, I already saw the error, "I just pushed the fix to production, "so just try it again." Imagine their surprise.
20:39 So surprise and delight, your users, Create your Century account at pythonbytes.fm/Century.
20:44 And when you sign up, there's a, got a promo code, redeem it.
20:47 Make sure you put Python Bytes in that section or you won't get a few months of free Century team plans and other features and they won't know it came from us.
20:54 So use a promo code at pythonbytes.fm/Century.
20:58 Yeah, thanks for supporting the show.
21:00 Brian.
21:01 - Yeah.
21:01 - I like this one that you picked here.
21:02 - You like this?
21:03 - I like it a lot.
21:04 It's very good.
21:05 It has pictures, little animated things and great looking tools.
21:09 - Yeah, so there's a, It was an article that was sent to us.
21:11 I can't remember who sent it, so apologies.
21:14 But it's an article called Three Tools to Track and Visualize the Execution of Your Python Code.
21:20 I don't know why. Executing your code just seems funny to me.
21:25 I know it just means run it, but chop its head off or something.
21:30 Anyway, the three tools it covers are, we don't cover this very much because I don't know how to pronounce it.
21:39 L-O-G-U-R-U, it's Loguru or Loguru, not sure.
21:45 Loguru is a pretty printer with better exceptions.
21:50 Let's go and look at that.
21:52 It does exceptions like this, breaks out your exceptions into colors, and it's just a really great way to visualize it.
22:00 I would totally use this if I was teaching a class or something.
22:05 this might be a good way to teach people how to look at trace logs and error logs.
22:12 This is fantastic. And if you're out there listening and not seeing it, you should definitely pull up this site, because the pictures really are what you need to tell quickly.
22:19 Yeah.
22:20 That's one of the things I like about this article is that lots of great pictures.
22:25 One thing out of curiosity. So what I'm seeing here is that, for example, it says return number one divided by number two, and then you actually see the numbers that were in those variables.
22:33 Do you have to add a decorator or something to get this output?
22:36 Or how does that work?
22:39 – That's explained later, maybe. – I don't remember where...
22:42 – Yeah, it's explained later, I think. – Yeah.
22:45 I think you just pull it in and it just does it, but I'm not sure.
22:48 – Okay, interesting. – Anyway.
22:50 So that's LogGuru.
22:53 Then there's Snoop, which is kind of fun.
22:57 That has...
22:59 Hold on to Snoop. Should have had this already.
23:02 Anyway, you put with Snoop, you can see it prints lines of code being executed in a function.
23:08 So it just runs your code and then prints out each line in real time as it's going through it.
23:14 You would hardly ever want this, I think, but when you do want it, I think it might be kind of cool to watch it go along.
23:22 And you could also do this in a debugger, but if you didn't want a debugger, do a debugger.
23:27 You can do this on the command line.
23:28 One of the things that most debuggers have It's a little challenging is you'll see the state and you'll see the state change and you'll see it change again.
23:37 But in your mind, you've got to remember, okay, that was a seven and then it was a five and then it was a three.
23:43 Oh, right. Yeah.
23:44 Right. And here it'll actually reproduce each line, each block of code with the values over.
23:50 If you are in a loop three times, it'll show like going through the loop three times with all the values set.
23:55 And that's pretty neat.
23:57 Yeah.
23:57 I would also argue just for teaching recursion, I think this visualization is kind of nice 'cause you actually see the indentation and the depth appear and so you can actually see this function is called inside of this other function and there's a timestamp.
24:09 So I would also argue this one's pretty good for teaching.
24:12 - I like it, in fact, Connor on the livestream says, "I'm teaching my first Python course tomorrow." So yeah, thanks for the timely article.
24:20 And a real-time follow-up for the Log Guru, you have to import Logger and then you gotta put a decorator on the function and then it'll capture that super detailed output.
24:30 >> That's probably exactly what you want because you don't really want to do that for everything probably.
24:36 There'll be something you're working on that you want to trace.
24:40 Heart rate is the last tool that we want to talk about.
24:44 It's a way to visualize the execution of a Python program in real-time.
24:48 This is something we have not covered before, but it's, I thought there was a little video.
24:53 Yeah. It goes through and does a little like a heat map sort of thing on the side of your code.
25:02 So when it's running, you can kind of see that different things get hit more than others.
25:08 So that's almost like a profiler, sort of not speed, though.
25:13 It's just number of hits.
25:14 Yeah.
25:15 Yeah, I'm kind of on the fence about this, but it's pretty.
25:19 So yeah, same, but the logger one looks amazing.
25:24 I thought Loggeroo was also like a general logging tool.
25:26 Like it does more, I think, than just things for debugging.
25:31 - Yeah, I think it's a general logging tool as well.
25:34 - Okay. - Okay.
25:35 - But I guess it logs errors really good.
25:37 (Brian laughs)
25:40 - So anyway-- - Logger.catchDecorator.
25:42 Okay, could probably do other things with the logger then as well, but having a good logging debugger catcher is always welcome.
25:49 - Yeah, absolutely.
25:50 All right, let's talk about Dux.
25:52 I mean, Brian, you and I are in Oregon.
25:54 - Go Dux. - Go Dux.
25:55 Is that a, well, I know your daughter goes there, My daughter goes to OSU, so go B's I guess.
26:01 Whatever, we're gonna talk Duck databases anyway.
26:03 And data science.
26:04 So Alex Monahan sent over to us saying, "Hey, you should check out this article about DuckDB," which is a thing I'm now learning about.
26:13 And it's integration, it's direct integration with pandas.
26:16 So instead of taking data from a database, load it into a pandas data frame, doing stuff on it, and then getting the answer out, you basically put it into this embedded database, DuckDB, which is SQLite-like, and then, sorry, you put it into a pandas data frame, but then the query engine of DuckDB can query it directly without any data exchange, without transferring it back and forth between the two systems or formats.
26:41 That's pretty cool, right?
26:42 So, let me pull this.
26:44 - Oh, that's Hannes.
26:45 I know him.
26:46 - Nice.
26:47 - Yeah, he's from Amsterdam.
26:48 - Yeah, very cool.
26:49 So here's the idea.
26:50 We've got SQL on pandas, basically.
26:54 So if we had a data frame, here they have a really simple data frame, but just a, you know, a single array, but it could be a very complex data frame.
27:03 And then what you can do is you can import DuckDB and you can say duckdb.query, and then you write something like, so one of the columns is called A in the data frame, and you could say select sum of A from the data frame.
27:17 How cool is that?
27:18 - I don't know.
27:19 Is it cool?
27:20 - It's very cool.
27:21 Then you can also, there's also a two data frame on the result.
27:24 So what happens here is this is parsed by DuckDB, which has an advanced query optimizer for things like joins and filtering and indexes and all that kind of stuff.
27:37 And then it says, oh, okay.
27:39 So you said there's a thing called my DF, which I'll just go look in the locals.
27:44 Of my current call stack and see if I can find that.
27:47 Oh yeah, that is neat.
27:49 So you can write arbitrary SQL.
27:51 And this one looks pretty straightforward.
27:52 You're like, yeah, yeah, okay, interesting, interesting.
27:54 But you can come down here and do more interesting things.
27:58 Let's see, I'll pull up some examples.
28:00 So they do a select aggregation group by thing.
28:04 So select these two things, and then also do a sum min max and average on some part of the data frame.
28:11 And then you pull it out of the data frame and you group by two of the elements, right?
28:16 And they show also what that would look like if you did that in true pandas format.
28:20 That's cool.
28:21 And they say, well, it's about two to three times faster in the DuckDB version.
28:26 - That is interesting.
28:27 - That's interesting, right?
28:28 But then they say, well, what if we wanted not to just group by, but we wanted a filter?
28:33 Seems real simple, like where the ship date is less than 1998, no big deal.
28:38 But because the way that this be really officially figured out by the query optimizer, it turns out to be much faster.
28:46 So 0.6 seconds on single threaded or it actually supports parallel execution as well.
28:52 So multi-threaded, they tested on a system that only had two cores, but it can be many, many cores.
28:58 So it's faster 0.4 seconds when threaded versus 2.2 seconds, sorry, 3.5 seconds on regular pandas.
29:06 But there's this more complicated non-obvious thing you can do called a manual pushdown in pandas, which will help drive some of the efficiency before other work happens.
29:16 And then they finally show one at the very end where there's more stuff going on that Query Optimizer does.
29:22 So the threaded one's 0.5 seconds, regular pandas is 15 seconds.
29:26 So all that's cool and what's really neat is it all just happens like on the data frame.
29:30 - Yeah, there's two things about that that are pretty interesting.
29:32 Like one is we shouldn't underestimate how many people are still new to pandas but do understand SQL.
29:37 So just for that use case, I can imagine, you know, you're gonna get a lot of people on board.
29:42 But the fact that there's a Query Optimizer in there that's able to work on top of pandas, that's also pretty neat.
29:47 'Cause I'm assuming it's doing clever things like, oh, I need to filter data, I should do that as early on as possible in my query plan, it's doing some of that logic internally.
29:55 And the fact is you can paralyze it, 'cause pandas doesn't paralyze easily.
30:00 It's also something--
30:00 - Yeah, I don't know that it paralyzes at all.
30:02 You gotta go to something like Dask.
30:04 - Yeah, I mean, so there are some tricks that you could do, but they're tricks, they're not really natively supported.
30:10 - Right, right.
30:12 but just having a SQL interface is neat.
30:14 - Yeah, yeah, this is pretty neat.
30:16 And also now I learned about DuckDB.
30:17 So apparently that's a thing, which is pretty awesome.
30:21 So it's in process, just like SQLite, it's written in C++ 11 with no dependencies.
30:28 Supposed to be super fast.
30:29 So this is also a cool thing that, maybe I'll check out unrelated to query in pandas, but the fact that you can, I think is pretty cool.
30:35 - It's got a great name.
30:37 - Yeah, you know, another database out there, I hear a lot about, but I've never used, that I have really an opinion about is cockroach DB.
30:45 I'm not a huge fan of just on the name, although it has some interesting ideas.
30:49 I think it's like meant to communicate resiliency and it can't be killed 'cause it's like geo-located and it's just gonna survive, but yeah.
30:56 Ducks, I'll go with ducks.
30:57 - Yeah, I would agree.
30:59 - Yeah, and then a chat out in the live stream chat, Christopher says, so DuckDB is worrying on pandas data frames or can you load the data method chain with DuckDB and reduce memories?
31:09 I believe you could do either.
31:11 Like you could load data into it and then there's a two data frame option.
31:15 That probably could come out of it, but--
31:17 - I think just very briefly--
31:18 - It's basically right on it.
31:20 Yeah, go ahead.
31:20 - It doesn't, I might've just seen it briefly while you were scrolling in the blog post, but I believe it also said that it supports the Parquet file format.
31:27 - It does.
31:28 - So the nice thing about Parquet is you can kind of index your data cleverly.
31:32 Like you can index it by date on the file system.
31:34 And then presumably if you were to write the SQL query in DuckDB, it would only read the files of the appropriate date if you put a filter in there.
31:41 So I can imagine just because of that reason, DuckDB on its own might be more memory performant than Panas, I guess.
31:47 - Yeah, perhaps.
31:49 - That's stuff like that you could do.
31:51 - Yeah, and then Nick Harvey also says, I wonder if it's read-only, if you can insert or update.
31:56 I don't know for sure, but you can see in some of the places they are doing projections.
32:02 So for example, they're doing a select some min max average.
32:06 that's generating data that goes into it.
32:08 And then the result is a data frame.
32:10 So you can just add into the data frame afterwards if you want to be more manual about it.
32:14 Yeah. All right.
32:15 Vincent, you got the last one?
32:16 - Yeah. So the thing is I work for a company called Rasa.
32:20 We make software with Python to make virtual assistants easier to make in Python.
32:26 And I was looking in our community showcase and I just found this project that just made me kind of feel hopeful.
32:31 So this is a personal project, I think.
32:34 So we have a name here, Amit, and I hope I'm pronouncing it correctly, Arvind.
32:39 But what they did is they used Rasa kind of like a Lego brick, but they made this assistant, if you will, that you can send a text message to.
32:46 Now, what it does, I'll zoom in a little bit for people on YouTube that might be able to see the GIF, but every 10 minutes, it scrapes the weather information, the fire hazard information, and I think evacuation information from local government in California, meant to help people during wildfire season.
33:02 And they completely open-sourced this project as well.
33:05 So there's a linked GitHub project where you can just see how they implemented it.
33:09 And it's a fairly simple implementation.
33:12 They use Raza with a Twilio API.
33:14 They're doing some neat little clever things here with, like, if you misspelled your city, they're using a fuzzy string matching library to make sure that even if you misspell your city, they can still try to give you accurate information.
33:26 But what they do is they just have this endpoint where you can send a text message to, like, "Give me the update of San Francisco." And then it will tell you all the weather information, air quality information, and that sort of thing.
33:36 And if you need to evacuate, it will also be able to tell you that.
33:39 And what I just loved about this, if you look at the way that they described it, this was just two people who knew Python who were a little bit disappointed with the communication that was happening, but because the APIs were open, they just built their own solution.
33:53 And like thousands of people use this.
33:55 And what's even greater is that, you know, if your mobile coverage isn't great, watching a YouTube video or like trying to get audio in can be tricky, but a text message is really low black bandwidth.
34:05 So for a lot of people, this is like a great way to communicate.
34:08 and of course I'm a little bit biased cause I worked for Rasa and I think it's awesome that they use Rasa to build this.
34:13 but again, the whole thing is just open source.
34:16 You can go to their GitHub and you can just, if I'm not mistaking, there's like the scraping job of the end points actually in here as well.
34:25 but this is like exactly what you want.
34:27 Just a couple of open API APIs and sort of citizen science, building something that's useful for the community. It's great.
34:31 >> Yeah, I like it.
34:33 Text message is probably a really good way to communicate for disasters.
34:36 >> Yes.
34:37 >> Possibly in a place where LTE is fresh, Wi-Fi is out.
34:43 Even if you're on edge, text should still get there.
34:47 >> Exactly.
34:47 >> Unless you're on iMessage, then you're out of luck. No, I don't know.
34:51 >> I live in Europe, so I cannot comment on that, of course, but it's a little bit different here.
34:57 But no, the data service, you can just look in here. And this is like, again, I like these little projects that don't need anyone's permission to help people like that stuff. Like, this is good stuff.
35:07 And the thing that I also really like about it is, it's really just sending you a text message with like air quality information, like enough information. And that's good. It's not like they're trying to make like a giant predictive model on top of this or anything like that. They're just really doing enough and enough is plenty. Like that's the thing I really love about this little demo. And of course, using Rasa, which is great. But this is the kind of stuff that this is why I get up in the morning projects like this.
35:32 That's fantastic. Yeah, I love it. That's really good one.
35:36 Brian, is that it?
35:37 Yeah, that's it. That's our six items. Any extras that you want to talk about?
35:43 I might have one. Okay. Yeah. Okay. So I'm totally tooting my own horn here. But this is a project I made a little while ago, but I think people might like it.
35:54 So at some point, it kind of struck me that people were making these machine learning algorithms and they're trying to like on a two dimensional plane, trying to separate the green dots from the red ones from the blue ones.
36:04 And I just started wondering, well, why do you need an algorithm if you can just maybe draw one?
36:10 So very typically you got these like clusters of red points and clusters of blue points.
36:14 And I just started wondering, maybe all we need is like this little user interface element that you can load from a Jupyter notebook.
36:21 And maybe once you've made a drawing, it'd be nice if we can just turn it into a scikit-learn model.
36:25 So there's this project called Human Learn that does exactly this.
36:30 It's a tool of little buttons and widgets that I've made to just make it easier for you to do your domain knowledge thing and turn it into a model.
36:38 So one of the things that it currently features is the ability to draw a model, which is great because domain experts can just sort of put their knowledge in here.
36:46 It can do outlier detection as well 'cause if a point falls outside of one of your drawn circles that also means that it's probably an outlier.
36:52 But it also has a tool in there that allows you to turn any Python function, like any custom Python-written function, into a Scikit-learn-compatible tool as well.
37:01 So if you can just declare logic in a Python function, that can also just be a machine learning model from now on.
37:07 There's an extra fancy thing, if people are interested.
37:09 I just made a little blog post about that, where I'm using a very advanced coloring technique using parallel coordinates.
37:18 Very fancy technique.
37:20 Won't go into too much depth there, but what's really cool is that you can basically show that a drawn model cannot perform the model that's on the Keras Deep Learning blog, which I just thought was a very cool little feature as well.
37:33 The project's called Human Learn.
37:35 It's just components for inside of your Jupyter Notebook to make domain knowledge and human learning and all that good stuff better.
37:43 Also, with the fairness thing in mind, I really like the idea that people sort of can do the exploratory data analysis bit, and at the same time also work on their first machine learning models and benchmark.
37:53 That's what HumanLearn does.
37:54 So if people are sort of curious to play around with that, please do.
37:58 It's open source, they've installed, please use it.
38:00 - I'm impressed, this is cool.
38:01 That's really cool. - It is, right?
38:02 - Yeah.
38:03 - Maddy out in the livestream asks, "How does it handle ND data?" Or N, I guess it's three or larger.
38:09 - Yeah, so you can make, like, so if you have four columns, you can make two charts with two dimensions.
38:15 That's one way of dealing with it.
38:16 There's a little trick where you can combine all of your drawings into one thing.
38:19 If you go to the examples, though, the parallel coordinates chart that you see here, that has 30 columns and it works just fine. I do think 30 is probably the limit.
38:27 But the parallel coordinates chart, you can make a subselection across multiple dimensions. That just works. It's really hard to explain a parallel coordinates chart on a podcast, though. I'm sorry. Yeah, so this is a super interactive visualization thing with lots of colors and stuff happening. I'm sorry, you have to go to the docs to fully experience that. But again, also, like if you, let's say you work for a fraud office and someone asks you like, "Hey, without looking at any data, can you come up with rules that's probably fraud?" And you can kind of go, "Yeah, if you're 12 and you earn over a million dollars, that's probably weird. Someone should just look at that." And the thing is, you can just write down rules that way. And that should already be, can already be turned into a machine learning model. You don't always need data. And that's the thing I'm trying to cover here. Like just make it easier for you to declare stuff like that.
39:14 It's a more human approach.
39:16 Brian, I cut you off. Were you going to say something?
39:18 Oh, one of the things, I don't know if we've covered this already, but we've talked about comcode.io a lot on this podcast.
39:26 And you're the person behind it, right? Yeah, I am.
39:30 It's been a fun little side project that I've been doing for a year now.
39:34 So, nice videos. I like how short they are. Thanks.
39:38 People tell And that's also the thing that I was kind of going for.
39:42 Like I love the, you know when you watch a video, it's like a lightning talk, and you learn something in five minutes?
39:48 - Yeah.
39:49 - Oh, that's an amazing feeling.
39:50 That's the thing I'm trying to capture there a little bit.
39:52 Like if it takes more than five minutes to get a point across, then I should go on to a different topic.
39:58 But I'm happy to hear you like it.
39:59 - Cool. - Yeah, very cool.
40:00 - How about you, Michael?
40:01 Anything extras?
40:02 - Well, I had two, now I have three, because I was reading the source code of one of Vincent's projects there, and as we were talking and I learned about fuzzy wuzzy.
40:13 (laughing)
40:16 So fuzzy wuzzy was being used in that emergency disaster recovery awareness thing.
40:21 And it's fuzzy string matching in Python and it says fuzzy string matching like a boss.
40:27 Which you gotta love.
40:28 So it was like slight misspellings and plural versus not plural and whatnot.
40:32 And Brian even uses hypothesis which is kind of interesting.
40:36 - Yeah, and pytest.
40:38 - Yeah, and pytest of course.
40:39 Anyway, that's pretty cool, I just discovered that.
40:42 - So, Fuzzy Wuzzy is a pretty cool tool.
40:45 The only thing I don't like about it, and it's the one thing I do have to mention, it is my understanding that Fuzzy Wuzzy is a slur in certain regions of the world.
40:53 So in terms of naming a package, they could have done better there, but I think they only realized that in hindsight.
40:57 Other than that, there's some cool stuff in there, definitely just, when I learned about this, I did make the comment to myself, like, okay, I should always acknowledge it whenever I talk about the package.
41:05 But yeah, it's definitely useful stuff in there.
41:08 Fuzzy string matching is a useful problem to have a tool for.
41:10 Yeah, very cool.
41:11 And PyCon, way out in the future, 2024-2025 announcement is out.
41:18 So the next two PyCons are already theoretically in Salt Lake City.
41:23 So hopefully we actually go to Salt Lake City and not just go and we'll virtually imagine it was there, right?
41:29 Like this year.
41:30 But last two years because of the pandemic, Pittsburgh lost its opportunity to have PyCon, so not just once but twice.
41:38 So they are rescheduling the next one back into Pittsburgh.
41:41 So folks there will be able to go and part of PyCon.
41:44 - That's pretty cool. - Wow.
41:46 - Because of Corona, they've now been able to plan four years ahead of the way.
41:49 - Exactly.
41:50 - It's beautiful.
41:51 - Everything's upside down now.
41:53 And then also, I just wanna give a quick shout out to an episode that I think is coming out this week on Talk Python, I'm pretty sure that's the schedule, called CodeCarbon.io.
42:02 And it is a, I'm gonna pull it up here.
42:05 It is both a dashboard that lets you look at the carbon generation, the CO2 footprint of your machine learning models, as you, specifically around the training of the models.
42:16 So what you do is you pip install somewhere in here, you pip install this emission tracker, and then you just say, start tracking, train, stop tracking.
42:25 And it uses your location, your data center, the local energy grid, the sources of energy from all that.
42:31 And it'll say like, oh, if you actually switch to say the Oregon AWS data center from Virginia, you'd be using more, you would be using more hydroelectric rather than, I don't know, gas or whatever, right?
42:44 So just we were talking about some of the ethics and cool things that we should be paying attention to.
42:48 And I feel like the sort of energy impact of model training might be worth looking at as well.
42:53 - So I totally agree with model training.
42:56 I've been wondering about this other thing though, and that's testing on GitHub.
42:59 Like if you think about some of these CI pipelines, they can be big too.
43:02 Like I've heard projects that take like an hour on every commit.
43:05 I'd be curious to run this on that stuff as well.
43:08 - Yeah, well, you could turn on, you could employ this as part of your CI/CD.
43:14 It doesn't really have to do with model training per se, but it does things like when you train models that use a GPU, it'll actually ask the GPU for the electrical current.
43:24 - Ah, right.
43:25 - Right, so it goes down into the hardware.
43:27 - That's a fancy feature.
43:28 That's a fancy feature.
43:29 And it goes down to the CPU level, the CPU level voltage, and all sorts of low, it's not just, well, it ran for this long, so it's this, right?
43:37 - Ah, okay. - It's really detailed.
43:39 That said, I suspect you could actually answer the same question on a CI, right?
43:44 It would just say, well, it looks like you're training on a CPU.
43:46 (both laughing)
43:48 - Yeah, true.
43:49 But so, yeah, it's a nice way to be conscious about compute times and stuff, so that's--
43:53 - Yeah, and what's cool is it has the dashboard that actually lets you explore.
43:58 Well, if I were to shift it to Europe rather than train in the U S which, who really cares where it trains.
44:02 Would that, what difference would that have?
44:04 Look at how green Paraguay is.
44:06 We are hosting.
44:07 Yeah.
44:08 That's incredible.
44:09 I suspect a lot of waterfalls.
44:11 Yeah.
44:12 Countries down there have insane amounts of hydro.
44:15 like Chile, maybe I can't remember exactly, but yeah, a lot of hydro and you see, and you see Iceland as well.
44:20 And it's probably because of the volcanoes and warmth and heat.
44:23 And yeah.
44:23 Yeah.
44:23 The to, yeah.
44:24 Okay.
44:24 Interesting.
44:25 All right.
44:25 Nice Brian.
44:26 You got anything?
44:27 - No, not this week.
44:29 - How about we do a joke?
44:30 - Sounds good.
44:31 - So, it's been a while since I've been to a strongest man competition.
44:36 World's strongest man.
44:37 You know, like, maybe one of those things where you pick up like a telephone pole and you have to carry this, throw it as far as you can, or you lift like the heaviest barbells, or like you carry huge rocks some distance.
44:47 So here's one of those things.
44:48 There's like three judges, bunch of people who look way over pumped.
44:53 They're all flexing, getting ready.
44:55 The first one is this person carrying a huge rock, sweating clearly, and the judges are, they're not super impressed.
45:02 They give a five, a two, and a six.
45:04 Then there's another one lifting this 500-pound barbell over his head, says eight, seven, and six is their score.
45:10 And then there's this particularly not overly strong-looking person here, says, "I don't use Google when coding." Wow, so strong.
45:19 The judges give him straight 10s.
45:21 (laughs)
45:22 - And he's also being really sincere.
45:24 hand over his heart. Oh yeah, like it's very humble. Yeah, exactly.
45:29 All right, well, that's what I got for you.
45:33 Take it. Take it for what you will.
45:34 That's pretty good. Just stack overflow.
45:37 Yeah. Yeah. Well, I feel like stack overflow would be we give take it to 11.
45:41 Honestly, I don't use stack overflow now.
45:45 Yeah. Winner.
45:46 Definitely. That's funny.
45:49 Well, thanks for that.
45:51 You usually get pretty good about finding our jokes.
45:53 So I appreciate it.
45:54 And thanks, Vincent, for coming on the show.
45:57 - Thanks for having me.
45:59 It was fun.
46:00 - I think that's a wrap.
46:01 - Yeah, that is.
46:02 Thanks, Brian.
46:03 - Thank you. - Bye, Vincent.