Brought to you by Michael and Brian - take a Talk Python course or get Brian's pytest book


Transcript #238: A cloud-based file system for Python and a new GUI!

Return to episode page view on github
Recorded on Tuesday, Jun 15, 2021.

00:00 Hello and welcome to Python Bytes where we deliver Python news and headlines directly to your earbuds.

00:04 This is episode 238 recorded June 15th, 2021.

00:09 I'm Michael Kennedy.

00:10 And I'm Brian Okken.

00:11 I'm Julia Signel.

00:12 Hey, Julia. Thanks for coming on the show.

00:14 Yeah, thanks for having me.

00:16 Yeah, it's great. Why don't you tell folks a bit about yourself?

00:19 Yeah, so I'm the head of open source at Saturn Cloud and a maintainer of Dask.

00:24 So I split my time half and half.

00:26 I spend half my time just doing regular maintenance-y stuff on Dask, and then half my time doing engineering and product management on Saturn Cloud.

00:36 Saturn Cloud is a data science platform that really specializes in distributed Dask clusters and Jupyter, and making it really easy for people to get up and going with those things on AWS.

00:47 - Yeah, Dask is really interesting.

00:50 When I first heard about it, I thought, okay, this is a grid computing scale-out thing, which I probably don't have a lot of use for, but then I was speaking with Matthew Rocklin about it, and it has a lot of applicability, even if you have not huge data, huge clusters, right?

01:05 Like you can say, even on your local machine, scale this out across my cores, or allow me to work with more data than will fit in RAM on my laptop, and stuff like that, right?

01:14 It's a cool idea.

01:15 - Yeah, yeah, it has like a whole number of different ways of interacting with it, right?

01:19 Like there's that, there's like, just make this thing go faster by parallelizing it, there's all the data framey stuff, there's all the array stuff for more dimensional data.

01:28 So it's got a large API.

01:31 - Yeah, cool.

01:32 And we're gonna touch on a couple of topics that are not all that unrelated to those things here.

01:37 And so, yeah, speaking of data science, Brian, you wanna kick us off?

01:41 - Sure.

01:43 Yeah, the first thing I wanna cover is an article called "The Practical SQL for Data Analysis." This is by Aki Benita.

01:51 So one of the things I liked about this is it was kind of talking about, the first bit of the article was talking about basically that with data science, you've got pandas and NumPy and stuff.

02:06 And you also often you're dealing with a database, and SQL on the back end.

02:12 So the first part of the article talks about how some things you can do both in pandas And in SQL, like SQL queries, it's faster in SQL.

02:24 So there's a big chunk that's just talking about how that's faster.

02:29 But then he also talks about just basically there's a lot of benefits to the flexibility and the comfortableness you can have with pandas though.

02:39 So trade-offs as to where you can, of where you're gonna push it too far into SQL or having a split is good.

02:47 But then he goes through and talks about a whole bunch of great examples of different things like pivot tables and roll-ups and choices and different things you can do with either Pandas or SQL and really what his recommendations are for whether it should be in Pandas or in SQL query and then how to do those queries.

03:07 Because I mean really the gist of the article and this problem space is people are comfortable with pandas, but they don't really understand SQL queries.

03:17 So this sort of good cheat sheet for how to do the queries is, I think, really kind of a cool thing.

03:24 Yeah, I think it's really neat.

03:27 And you have these problems, you know how to solve them in one or the other.

03:30 And I think this compare and contrast is really valuable.

03:33 Like I know how to take the mean of some column in SQL, but I haven't done it in pandas yet.

03:39 Let's go see how to do that.

03:40 Or I'm really good at doing pivot tables in pandas, but boy, I always kind of avoided joins in SQL.

03:45 They scared me.

03:46 And then how does that even translate, right?

03:48 I think that back and forth is really valuable.

03:49 - Yeah, yep.

03:51 And then it covers things that I don't even know what they are, like aggregate expressions.

03:55 I don't even know what that is, but apparently that's a thing that people do.

03:59 - I can help you out at aggregate stuff.

04:01 No, just kidding.

04:03 Julie, what do you think of this?

04:04 - Yeah, no, it seems, it's really cool.

04:05 Like, I agree that like, that having the, having an appendix and then in SQL, that comparison is super helpful.

04:12 Like SQL is always super scary to me and I always end up like Googling a bunch of stuff whenever I have to mangle my SQL.

04:19 But I know it's so fast, so it's cool to see a way to access that.

04:23 - Yeah, absolutely.

04:24 This is a good one, Brian.

04:25 I think a lot of people will find it useful.

04:26 I also wanna just give a quick shout out for the past a little bit.

04:30 Not too long ago, we talked about an efficient SQL on pandas with DuckDB where you actually do the SQL queries against pandas data frames.

04:39 So if you're finding that you're trying to do something and maybe it would be better in SQL, but you don't wanna say completely switch all your data over to a relational database, you just kinda wanna stay in the Panda side, but there's that one or two things, like this is really cool, this sort of upgrade your data frame to execute SQL with the DuckDB query optimizer is also a kind of a nice intermediary there.

05:04 - Yeah, Dask also does some, I'm gonna try not to make everything about Dask, but Dask does some things that are kind of, that kind of take some of the ideas from this article of like doing predicate pushdown of like, of pushing down some of the like filters into the read because it evaluates lazily.

05:22 It doesn't have to like grab all the data greedily up front.

05:24 It can like do that later.

05:26 So you can get some of the benefits.

05:28 - That's cool.

05:29 And it can also distribute the filter bit, I guess at that point.

05:32 Yeah, nice.

05:33 All right, I wanna talk about the usual suspects.

05:36 So, okay, that was a pretty good show.

05:38 Was that Quentin Tarantino or something like that?

05:41 It's not actually about this.

05:42 This comes to us from Ruslan Portnoy.

05:45 And thank you for sending this in.

05:46 Mentioned an article that has this really interesting idea.

05:50 How do you apply git blame when you encounter a Python traceback?

05:56 So here's the scenario.

05:57 Your code crashes and you either print out the traceback or Python does it for you because it's just crashed.

06:03 And normally it says, here's the value, here's the line of code, here's the file it's in, here's the next line in the call stack, here's the line of code it's in.

06:12 The idea is you can take git blame, which is a command that says, show me who changed this line of code or who wrote this line of code, at least touched it last on every single line of code.

06:24 And I love this whole idea of like, all right, who did this?

06:26 And sometimes I'll come across code.

06:28 I'm like, this is so crappy, like who did this?

06:30 Oh, wait, that's me.

06:31 Okay, well, at least I know how I would feel about it.

06:35 But the idea is what if your trace back on each line where it had an exception could also show who wrote that line of code.

06:42 Cool, huh?

06:43 - Yeah, that'd be great.

06:43 - Yeah, so let's check it out.

06:45 It's pretty straightforward.

06:46 This is an article by Afar Khoren, and it basically uses two libraries that are themselves both pretty straightforward.

06:54 So like here's a straightforward example of a trace back, like trying to pop something off of an empty list.

07:00 It says on this line in the function popSum, you know, there's this line here in the call stack and then the next line, this line in the call stack and eventually raise a value error, you know, empty range, can't pop nothing off, you know, something off of nothing, basically.

07:14 But this doesn't show you any information about like maybe who wrote that line and who wrote this other line up here, right?

07:19 So what they did is they took a couple of modules, trace back and then line cache.

07:26 And it turns out when trace back shows you this trace back, that line, it uses line cache to figure out, Okay, from this actual, I'm guessing, byte code that it's gonna run, this CPython interpreter code, where did it, like what line of file did this actually come from, right?

07:45 So here's the insight or the thing.

07:48 You can actually change what's in the cache.

07:51 And because it's a cache, once it's figured out what the lines are, it's not gonna read it again.

07:56 So it's like a list for each line that you get back.

08:00 and you can just change the value.

08:02 So it said, okay, well, here's like return random.

08:05 That's what the line of text was.

08:06 They're like, no, no, no, there's nothing to see here, move along.

08:09 If you make that and then you cause it to crash again, what comes out is, if you go a little bit further down, normal code, normal code, or normal trace back, normal trace back, then it just, instead of the line of code, it says nothing to see here, please move along.

08:21 All right, so what are you gonna do with that now that you realize like you can actually change what appears in the trace back?

08:28 So you write a little regular expression to go and execute get blame on the various files, and then to re-inject that back into line cache.

08:38 And so what they do is they just put, if they know the blame, they just put, you know, like 80 lines, 80 characters, up to 80 characters of the line, and then edit it on such and such date by such and such person, and here's the commit message, right?

08:53 And so just basically shelling out to get blame when it crashes now, you get some really cool stuff like on this slide, it says this is edited by many, many days ago by so-and-so in this Git commit and so on.

09:06 What's interesting, this is already in itself useful, I think.

09:10 But what's more interesting is other tools use this as well.

09:13 For example, if you use PUDB, which is a visual debugger, it's like a command line one, visual in the sense of like Emacs is visual, not like PyCharm is visual.

09:24 But it will actually pull up that data.

09:26 So you can see they jumped into the PDB bugger and it's actually showing all this get blame attribution as well that they've added.

09:33 So yeah, pretty interesting.

09:35 What do you all think?

09:36 - Yeah, I think that looks really cool.

09:37 I mean, I always do get blame whenever I run into something that's weird with the hope that someone else will be able to explain it to me.

09:44 - Exactly, who knows about this or who do I talk to about breaking this?

09:47 - Right, yeah, you could even put like PR numbers and stuff in here, right?

09:51 And that'd be pretty cool.

09:52 - Oh, PR numbers, very cool.

09:53 - Yeah, that'd be super cool.

09:54 - Yeah, one of the things I like, I don't really like that the name Git blame, but it's there.

10:00 But I agree with Julia that the main thing I use it for isn't to try to figure out who broke it, but who to ask about this chunk of the code.

10:09 - I agree.

10:11 'Cause usually when you see something that's really confusing or weird, you're like, I know they didn't just pick the hard way of doing this because they didn't wanna do the easy way.

10:19 There's something that I don't fully understand, some edge case that's crazy here.

10:23 I'm gonna go talk to that person, so yeah.

10:25 - Also the how long ago it was edited.

10:28 So if there was something edited yesterday, that's probably the problem.

10:31 - Yeah, exactly.

10:32 Like in this little screenshot here, some of these are edited like 1,427 days ago.

10:38 That's probably not the problem, maybe, but probably not.

10:41 - I feel like I have the opposite assumption.

10:42 Like if something is from six years ago and it's weird, I'm like, well, probably things were different back then.

10:47 And like, you know.

10:49 - Yeah, yeah, it's no longer applicable to the new data, new situation.

10:53 Yeah.

10:54 - Oh, that'd be an interesting thing also is to have like a tool that would tell you if something's like over a thousand days old or something like that, you probably should go refactor it to make sure somebody understands that code.

11:05 - Yeah, yeah, for sure.

11:07 All right, jumping back to the first item really quick, in the live stream, Alexander out there, hey Alexander, says, "I wonder if graph databases with Gremlin queries "could be more suitable for data science.

11:16 "SQL joins are way harder." Yeah, graph databases are pretty interesting.

11:20 If you're trying to understand the relationships, that may well be better.

11:23 I don't know.

11:24 - So, Lily, do you got any thoughts on this?

11:25 - I don't know anything about graph databases.

11:29 So, out of my league.

11:30 - I didn't have a desire to understand graph databases until I found out that there were Gremlin queries.

11:36 Now I think I wanna know.

11:38 - Brian, they don't start out as a Gremlin queries.

11:41 They're mogwai inserts.

11:43 And then if you insert them after midnight, then they become a Gremlin query.

11:47 I mean, come on, we all know how it goes.

11:49 You definitely don't wanna get them wet.

11:51 Oh, that's an old show.

11:53 I'm not sure if everyone's gonna get that reference, but yeah, I love that show.

11:56 Okay, anyway, let's move on to the next one.

11:59 The next one is you, Julia.

12:01 - Yeah, so I wanted to highlight FS spec.

12:05 So file system spec for people who can't hear letters very well.

12:08 So this is the basis for S3 FS.

12:14 FS, I'm not getting the letters right, but there's one for GCP, there's one for S3.

12:20 and basically it's a file system storage interface or like the basis for a file system.

12:27 And so you can do things like you can open just files as you can just take a path and open it as a file object in Python and read it with all the normal like read, write operations.

12:42 - Oh, interesting.

12:43 - But from anywhere.

12:44 So like there's all these different ones for S3, for GCFs.

12:49 GCFs, and even for HTTP, and just basically anything you can imagine, anywhere you can imagine a file being, either there's already been one of these written, it's kind of like a, it's an interface and then you write different packages on top of it that are like drivers or something, they have some name for it.

13:12 And it allows you to treat the file system as like this interchangeable building block, So you don't end up writing like photo three code or something that's like very specific to a specific cloud storage.

13:26 You write like this more general code and then it's really useful for like a lot of free datasets that are hosted on different clouds, but like they'll sometimes be on one cloud and sometimes be on another, but like basically it's the same data.

13:38 Or if you're at a company and you wanna like switch clouds, it just makes that whole thing so much easier.

13:45 - It looks really, really useful, especially for avoiding cloud lock-in.

13:50 - Yeah, yeah.

13:51 And you can always write, like you can always write your own one.

13:54 If something else pops up, you can write your own implementation of that.

13:58 - Right, so there's an example here, talking about using a file system in the docs, that says something to the effect of, well, you want to open up a CSV and feed it off to pandas read CSV.

14:08 So normally you would say open CSV file, and then you just say pandas read CSV and give it the file stream.

14:14 But what if that's on the internet?

14:16 What if that's on S3 with authentication?

14:18 What if that's somewhere else, right?

14:21 And so with this one, you can just say, FS file system spec open, here's a URL.

14:26 And now that's a stream, right?

14:28 Or that could be, here's an S3 location, S3 bucket.

14:32 Go get that, right?

14:33 - Yeah, yeah.

14:34 So instead of passing the path directly into the read function, you pass in the file object.

14:40 And it's really powerful.

14:43 Like it seems like a thing that we shouldn't need, but files get, like the file locations can get so crazy so quickly.

14:51 And this just really helps simplify and like make it so you don't have to think about this stuff which I think is what most people want.

14:57 It's what I want.

14:59 - Yeah, for sure.

15:00 So like there's a local file system option, but then you could also have an FTP file system or you could have something else, right?

15:07 All sorts of different options.

15:08 - Yeah, yeah, all sorts of stuff.

15:10 - Yeah, okay, that's cool.

15:12 Brian, what do you think?

15:13 Does this have any applicability for you?

15:15 - Oh yeah, definitely.

15:16 And that's a great abstraction layer to put in place to just have reading as if it was a file and have it moved.

15:25 It also helps you develop tools locally and then be able to deploy them into a larger space.

15:30 So it's cool.

15:31 - Yeah, for sure.

15:32 One of the things that always makes me a little hesitant when I hear people say things like, "We're cloud native." Like my app is cloud native.

15:40 That's always code word for me.

15:41 like I will never be able to run my app unless I'm connected to the internet.

15:45 You know, it's like, it depends on all these services together and there's no way I can recreate that locally.

15:50 But something like this could allow you to say, well, we're gonna have a local file system version, but then when we go to production, we'll switch to S3 or, you know, pick something.

15:58 - I've always wanted to make it either a t-shirt or a sticker or both that says, not a cloud native, just visiting.

16:04 - Nice, I also think Brian, there might be testing opportunities here.

16:09 - Yeah, definitely.

16:10 give it a test file system. That'd be cool.

16:13 Yeah, and like Julia said, swapping things out to just have your logic not have to care where it's coming from.

16:20 But I guess it would make sure, you'd have to make sure all of the interfaces, the different storage systems really are equal.

16:29 But I guess you'd try that out yourself.

16:32 Yeah, there's like kind of a bucket, right?

16:34 That there's kind of like a dict that you can pass, which is like storage options.

16:38 So I think that might get a little wonky depending on what the different backends need.

16:44 But the general principles are the same.

16:47 And it also, I should have said this originally, but it also allows, the FSSpec itself can contain logic to do things that are general to all the different libraries like caching and things like that.

16:58 To all the different--

16:59 - Oh, well, interesting.

17:00 Like you could put a caching layer on top of arbitrary things like S3, Google Storage, and Azure Buckets or Blob Storage.

17:07 - Yeah, yeah, maybe even save money on bandwidth there if you can do some caching.

17:12 - Yeah, if you can do it right.

17:14 - Yeah, super, super neat.

17:15 Brian, you're gonna tell us about how to slim down our Docker containers, but before you do, I wanna tell people about our sponsor for this episode, brought to you by Sentry.

17:23 So how would you like to remove a little stress from your life in addition to just abstracting your file system, maybe tracking down some errors?

17:30 So do you worry that your users may be having difficulties or encountering errors with your app right now?

17:35 And would you even know it until they send that support email?

17:38 How much better would it be if you got the error or performance details sent right away and with all the call stack, maybe you would get blame in there.

17:46 The local variables, the active user who was logged in while this happened, all that kind of stuff.

17:51 So with Sentry, it's not only possible, it's actually really simple.

17:55 I've used this on Sentry, I've used Sentry on our websites before, so it's on Python Bytes, stock Python training, all those different sites.

18:02 And I've actually had someone encounter an error trying to buy a course over on TalkByThon training.

18:07 I got the Sentry notification.

18:09 I said, "Oh, geez, I can't believe this problem." Crept in here and I fixed it really quick and started to roll out the fix and actually got an email.

18:16 They said, "Hey, we're having this problem buying a course." I said, "I know, I've almost got it fixed.

18:20 Just give me a moment and try again." And they were just like, "What?

18:23 That doesn't make sense." So they were very surprised.

18:26 And so it's surprising to let your users create your Sentry account at pythonbytes.fm/sentry.

18:30 And when you sign up, there's a little got a promo code.

18:33 make sure that you put PythonBytes, all one word, all caps, with a Y in there, and you'll get two free months plus a bunch of extra features and so on.

18:41 So also, it really lets them know that you came from us rather than just somewhere else, and that helps support the show a lot.

18:47 So, pythonbytes.fm/sentry and promo code PythonBytes.

18:50 Awesome, thanks for supporting the show, Sentry.

18:53 And Brian, let's talk Docker.

18:56 - Yeah, let's talk Docker.

18:57 I mean, I'm starting to use Docker more and more, and I like the experience, But I was interested when this article came up.

19:06 So it was in June, I saw this article called the Need for Slimmer Containers.

19:11 And this is from somebody Ivan, I'm not gonna try his last name, Ivan something.

19:18 But anyway, it's an interesting discussion.

19:20 And the idea around the original post was that there's now a Docker scan that you can use.

19:29 So you can use Docker scan to scan for vulnerabilities in your Docker containers.

19:35 And this, I haven't thought, well, I'll look at some of the standard Python containers that are available.

19:40 - Right, theoretically, some of the things that are nice is I can just go and say Docker, or in my Docker container, I can say from Python colon 39.

19:49 And I don't have to think about how do I install Python?

19:51 How do I keep it up to date?

19:53 You know, make sure that pip is there and that I'll be able and, you know, pip install stuff that needs to do build things.

19:58 that all that stuff will be there, right?

19:59 So it seems like, of course, this is what you want.

20:02 >> Yeah. Well, and also that's one of the neat things about Dockers.

20:06 I can just say, I have these standard parts, now I just want to put my custom stuff on top of it.

20:12 It's great. Well, what did he find?

20:16 Docker scan apparently uses a third-party tool called a Snake, S-N-Y-K, container.

20:24 We've covered Snake before, not the container version, but we covered Snake in episode 227.

20:31 It's looking for vulnerabilities and that's a good thing, but he found them in everything.

20:37 He found them in all of the standard Python ones, except for Alpine, I guess.

20:44 He didn't really know what to make of it really, he was just reporting his results that maybe Alpine is the only one with few vulnerabilities.

20:53 But then this went out on Hacker News and there was a big discussion around it.

20:59 So he updated the article, which I appreciate with some of the feedback that he got.

21:06 Some of the feedback was that these vulnerability checkers sometimes give you false positives.

21:12 I don't really have enough experience to know what that, well, I know what that means, but I don't have enough experience to know if these really are false positives or if they're actual vulnerabilities or not.

21:24 The other thing that maybe some people suggested that these standard ones really aren't updated very much.

21:33 I don't really know much about that either.

21:35 If they're not, that's a bummer because I think people are relying on them.

21:40 I actually just am left with a little bit of a confusion as to what to do.

21:46 I want to also mention that the Alpine is current one, there's original article, he says Alpine is pretty good for vulnerabilities.

21:54 But then his follow-up says, it doesn't, there's a lot of applications that can't run on Alpine because of some issues or another.

22:01 So anyway, I'm not sure what to make of it.

22:03 So I was hoping Michael might give us insight.

22:06 >> I did some thinking about this this morning.

22:09 In fact, I recently spoke a lot about this over on TalkBython.

22:14 So I had it in our on the show and we talked about best practices for Docker packaging and we talked a lot about both security and package size.

22:24 So I can try to relay a couple of things from that.

22:28 So we've got our official image over here, our Python official image.

22:32 There's actually a bunch of options.

22:34 As you can see, there's a few like 310 beta two buster or the 310 RC buster.

22:42 That sounds bad, but I think it's actually good.

22:44 No, I'm just kidding. I know what it is.

22:45 So these are by default based on Debian and Buster is the latest version of Debian.

22:52 And so you can do a Buster which is like full Debian with 3.10 or you can do a 3.10 slim Buster which is like a slimmed down version of Debian Buster that supports Python 3.10.

23:02 Okay, so there's a lot going on here in terms of the options.

23:07 One of...

23:08 So the article talks about how Alpine had the fewest security vulnerabilities.

23:14 And actually, so the Python latest, if you run the sneak package scanner thingy on it, it says there's 364 vulnerabilities.

23:24 If you just do Python latest, three nine, and 353 after you run apt update, apt upgrade.

23:32 So if you try to get the container to update itself, there's still 353 in that one.

23:38 I don't use that, I use Ubuntu.

23:40 So I use the Ubuntu latest.

23:41 And the bare version of that one had 31 vulnerabilities.

23:46 But then if I either install Python through app or build it through source and put it in the necessary foundational bits, like build essentials and stuff to build Python, it goes up to 35 total problems where 28 of them are low.

23:59 So seven are medium, nothing major.

24:01 One thing I thought was weird was I actually ran another step where I said, okay, let's uninstall those intermediate tools like GCC and W get and stuff like that I needed to get stuff on the machine, but I'm not going to use again.

24:13 And I took them away and almost all those warnings were about those tools that I had apped uninstalled.

24:19 So I don't know why sneak is still showing them, because if I go into the container, I type W get it says, Nope, this thing is not installed.

24:26 Sorry, but it still says the warning is that W get has a vulnerability in it, for example, right?

24:31 So there's like, there's like this over reporting for sure.

24:34 But I mean, the difference between 28 and 350 is not trivial.

24:38 Right, right.

24:39 So like run an apt install Python three type of thing is not, you know, it's probably worth it.

24:44 For example, when I switched from Python three, nine to Python three, nine slim buster, it went from 350 to 69.

24:54 So that's a lot better.

24:56 Right.

24:56 Yeah.

24:56 it's still not as good as a new two, but it's a lot better.

25:00 The it's still twice as many.

25:02 I mean, you can't, it sounds better, but it could be like 359 low problems and then 69 critical ones.

25:09 It totally could. It totally could.

25:11 Yeah, also if the reporting, if the reporting, like if the, if we can't trust Snake necessarily, then like maybe, you know, if you can't trust your reporting system, then like maybe none of this means anything, right?

25:25 Yeah. Yeah, I think one of the things the article originally started out to address was if you have fewer subsystems, there's no chance the missing subsystem could get hacked because it's not there.

25:36 there. Right. So if there's a vulnerability in SSH, but you literally don't install SSH, who cares? Whereas if you and you just take the full distribution, you may potentially get affected by something you dragged along. And then it went down this rattle of like, well, let me scan it and so on. I so I want to add one more thing like Alpine did result in the best outcome from the scanner, but there's a lot of issues with Alpine and Python.

26:01 So for example, there's this pep here, 656, that right now, if I try to pip install something on Alpine, so especially in the data science world where things are large and then compiling takes a lot of steps and so on, the wheels that are built for Linux are built for, what is it, glib, gclib, I mean, hold on, I'll look over here, I wrote it down so I know, no, I didn't write it I don't, sorry.

26:27 There's like, I think it's G lib or GC lib, which is the C runtime on like a Boon to and Debian, but there's one M U S L muscle on Alpine and the wheels.

26:37 Are not built for muscle.

26:39 They're built for G C lib.

26:41 So you can't hip install that you've got to download everything and then compile it.

26:46 And it's like compiling matlock, matplotlib and Jupiter from scratch can take a really long time versus just downloading the wheel.

26:53 and it takes up a lot of space and there's a bunch of issues and things around that that I can make it slightly not Python friendly.

27:00 That's why there's this pep 656 to allow wheels to be tagged as supporting muscle not GC lib.

27:09 Is that more than you wanted Brian?

27:10 Are you good?

27:11 Okay, so the takeaway that I'm getting is probably not panic on some of these, but maybe at least pay attention to them.

27:19 And it is good, like you said, to remove tools out of your Docker images that you're not using.

27:27 If you're not using Wget in your application, take it off, things like that.

27:31 - Yeah, exactly.

27:32 I think Julia's point was great, right?

27:33 It's, if you, it might be a false positive, but at the same time, if you're not gonna use it again, because Docker, a lot of times, you pip install all your stuff, and then it's kind of ready to run, but you're not gonna go and pip install something again.

27:46 you're gonna do a new Docker build from scratch, right?

27:49 Like one of the final lines could be remove all those intermediate things that could have problems and make it larger and whatnot.

27:56 - Yeah, I thought, so I've only thought about this from like package, from like image size, right?

28:02 Like that you want similar images just because it takes forever to get them around.

28:07 But it's interesting to think about from a vulnerability perspective.

28:10 And I've always seen it done as you do whatever installation you need and then you do all these like cleaning steps.

28:17 But what you said, Michael, about like not ever putting certain things on your image is interesting.

28:23 I haven't heard of that before.

28:25 - Yeah, thanks.

28:26 I also have Peter McKeith from, who works at Docker on Talk Python a little while, like six months ago or something, and he talks about having these multi-step builds, something to the effect of doesn't make as much sense with Python.

28:37 I'll try to put it together.

28:38 But like imagine you're building a Go library.

28:39 You could put the Go runtime and build tools on a container, build your thing, but the thing you get from Go is an actual binary that's all self-contained.

28:48 You could throw that container away and just copy the output of that into your actual container and never even put all those tools on the actual system that goes to production.

28:57 With Python, that might look something like maybe using PEX to package up all the stuff inside of a virtual environment.

29:04 And long as Python, the runtime is there, then you can like PEX run on your other machine.

29:08 But you could potentially not even ever install those, which might be good.

29:12 - Yeah, that makes sense.

29:12 - Yeah, there's a lot there that I'm, is sort of beyond my comfort level, but that's what I thought as I looked at this, Brian.

29:20 - Well, thanks for taking a look.

29:21 - Sure, you bet.

29:22 All right, we like to talk about GUIs on the show every now and then.

29:26 And so, and we wanna talk about pandas and data frames and data science and all that.

29:32 So let's put those together.

29:33 There's this project over here called Pandas GUI.

29:37 And the documentation is sparse, let's say.

29:41 It's pretty easy.

29:42 There's a couple of examples or two.

29:43 So I could come down here and I could like do my Panda stuff and create a data frame and then I could just import show from the Pandas GUI and within my notebook, it will pop open a separate window that it then allows me to cruise around and check it out.

29:56 So it does, you know, you can print out the data frame in a notebook and you get kind of a static Excel grid looking thing and that's nice.

30:06 But with this, you get a interactive one that lets you sort and select.

30:10 you can actually copy and paste chunks out of there as if it was Excel and then paste it in other places.

30:15 It also has a plotting library with like pictures.

30:18 So I'm gonna go click on the bar graph picture and then there's a list of all the columns and the things that the bar graph needs and you can drag and drop.

30:25 This column is the X axis and this column is the Y axis and I wanna group by color and have, you know, group by color it by some other aspect of the data and then like group into multiple charts or multiple lines or plots on a chart, all sorts of cool stuff like that.

30:42 There's a statistics section, there's you can export, import and export, I guess, import CSV files with drag and drop, and there's also search that you can do.

30:51 So it's a pretty neat, quick way to explore pandas.

30:55 - Yeah, it's a neat idea.

30:57 Like when you first encounter a data frame, like you really want to just be able to like look at it without any assumptions.

31:06 And there's a lot of stuff that kind of goes towards that with like the dot plot API and pandas and making that, making it really accessible to make plots really quickly.

31:15 But this is like kind of like that step beyond that, right?

31:18 Of just visualizing it immediately.

31:21 - Yeah, like one thing you get when you view the data frame as you know, like I said, it looks kind of just like printing DF in or just typing DF in the notebook.

31:30 But then on the right, you can say, oh, I want to see the filters and you can type in these filter expressions, these query expressions and then turn them all, like pile them on, you can have little check boxes to like optionally turn them off and not delete them.

31:42 And then of course you can sort within there like that.

31:45 And the graphing, I think the support for the graphing part is really, really helpful.

31:49 So the fact that you can just go and click and say, oh, I want a box plot and then the box plot needs these things you can just drag and drop from the column that you're from your data frame definition over and it just live updates.

32:02 - Yeah, I think that really lets people visualize the data in the way that they want to sometimes, rather than the way they already know how in that plot loop, which I think is what people end up doing, at least for exploratory stuff.

32:15 - Yeah, exactly.

32:16 You could real quickly switch between a bar, a box, a scatter plot, back and forth, without having to actually be familiar with how those work.

32:23 - Can you tell if there's a way to export the filters, or is there any mechanism for that?

32:30 I don't think so, at least in the YouTube explainer video, there were some comments like, you know what would be awesome?

32:37 Export this as code from here so that I can just turn it back into Python.

32:41 I didn't see anything like that, but--

32:44 - Yeah, sometimes GUIs are a little weird for me because of that, you know, like you end up in this GUI world and it's not, you can't reproduce anything.

32:51 - I clicked on a whole bunch of stuff and then it looked great, but don't touch it.

32:57 - Yeah, exactly.

32:57 - I can't do it again.

32:59 >> Okay. But to be fair, it is a fairly quick way to look at the data and know what you, maybe you can't produce that exact plot again, but you know what the data looks like and you can use a different plotting mechanism to do that.

33:13 >> Yeah. The visual is pretty clearly.

33:15 Okay. Well, x is assigned to speed and we know it's a histogram.

33:19 You could pretty quickly, with some Googling and Stack Overflow and go, "All right, how do I map plot level histogram and get that going?" >> That's a huge time saver.

33:28 >> Yeah, but some export of like, okay, give me the code to make this plot in my own code.

33:34 That would be great.

33:35 >> Yeah, absolutely.

33:37 On to the next, but before we get there, I do want to call out just a shout out by PyLang that FSSpec is sweet.

33:45 Good mention. Yeah, I like it as well.

33:47 >> Cool.

33:48 >> All right. X-ray.

33:50 >> X-ray.

33:51 - Okay, so X-Ray is my favorite library.

33:56 It's like a pandas, so it's a pandas-like API, but it's for n-dimensional data.

34:04 So if you have, a lot of times people talk about in geospatial data where there's that long time and others, but also for image data where there's maybe a bunch of different bands from satellite imagery or other disciplines where you just have labeled data that's not tabular, so the axes mean something, but there's not just one or two of them, then X-Ray is great for that, 'cause it lets you do things like you can select a certain subset of time, or a certain subset of whatever your dimension is, and you can also aggregate across different dimensions, and you can use the labels directly.

34:45 So if you don't have a tool like this, I see people doing this a lot with like machine learning workflows where they'll be, they'll have like separate, like a list of all their, they'll have like a list of all their labels and then they'll have their data and they'll do some manipulation and they'll try to like reattach them at the end.

35:05 - Oh no.

35:05 - And it just, it just turns into a mess.

35:09 And it's actually just like takes care of that all for you.

35:13 It's pretty great.

35:14 And I think that it has applications that have not been fully realized yet.

35:18 And it's starting to take off in other spaces, but it really comes from this geospatial world.

35:23 But I think it could be useful for all sorts of people.

35:25 - Right, because in geospatial, sometimes you have three dimensions, not just two.

35:30 - Yeah, you almost always have three, right?

35:33 - Sorry, Brian.

35:33 - No, the documentation looks great too.

35:35 The documentation has like getting started guides and tutorials and videos and galleries and stuff.

35:42 so definitely check out the documentation.

35:44 - Yeah, I think it got a major, it seems like, I looked at it for this too, and it seems like it got a major facelift, so it looks really nice.

35:52 It also has, like, it supports the .plot API, or some different version of it that's like the pandas version, but you can plot in different three dimensions, or aggregate and then plot, and so that's like a really nice way to get the visuals quickly.

36:11 And then the last thing that I wanted to say about it is that it's normally backed by NumPy arrays, but it can also be backed by Dask arrays or Sparse arrays or all sorts of different arrays natively.

36:23 So it's a really cool, it's another one of these building block things where you can have X-arrays like you're labeling and you're indexing and all the nice stuff and then down inside it can be NumPy or Qpy or Dask.

36:38 - How interesting.

36:39 - So it can do that juggling and piecing back together that other people are manually doing, and you just have this simple API, and if it has to do that, it'll figure it out.

36:48 - Yeah, yeah, that's pretty cool.

36:49 - Nice, and you talked about QPy and Dask, like those are some pretty interesting back ends for this.

36:56 - Yeah, yeah, the Dask one is, I said QPy, and now I'm wondering if maybe it's just like Dask and then QPy, so don't quote me on that right away, But yeah, the Dask one is really integrated with X-Ray code.

37:11 So they do just do some special things to make it so that it works with parallelizing and things.

37:16 But from the user experience, it's the same.

37:19 - Yeah, fantastic.

37:20 And then also noticed it requires Python 3.7.

37:23 Really nice to see tools sort of keeping up with the latest, not really old stuff.

37:28 - Well, hopefully it's 3.7 and above.

37:30 - Well, yeah, greater than or equal to.

37:33 - Well, I mean, I ran into a library.

37:35 was an internal thing that was only 3.7.

37:39 So I tried it on, I'm like, I assumed or above and I tried it on 3.9 and it fell over.

37:44 What's going on? It was only 3.7. It's weird.

37:48 >> That is weird.

37:50 >> That'd be interesting to think about what special features of 3.7 there depending on the broken 3.8.

37:56 >> Yeah, that's what I was thinking. How do you do that?

37:58 Without just checking for equal 3.7 on version.

38:01 >> Yeah.

38:02 >> So anyway.

38:03 >> Yeah. All right. Well, that's it for our six main topics.

38:06 Brian, you got anything else you want to throw out there quickly?

38:08 >> Yeah, actually. I didn't have this up, but on Twitter, somebody reacted to me with an emoji, and I didn't know what they meant.

38:23 Let me pop this up.

38:26 This Emojipedia, and it was helpful.

38:32 and you can just copy and paste the emoji that somebody uses in there, and it tells you what it means.

38:38 And the, you know, kind of not just what it's supposed to mean but also what people are using it for.

38:44 I don't know, for somebody that's sort of an old guy that is out of touch sometimes, this was helpful.

38:49 Anyway.

38:51 - Yeah, I mean, sometimes it's obvious.

38:53 Like a heart, we know what a heart means, right?

38:55 But, you know, like, hands together, it's not necessarily that that's like a thank you sort of bow type of thing.

39:01 >> There's certain ones where you're like, "What does that mean?" >> It was like a hands together with arrows coming out of the top.

39:06 I'm like, "I don't know what this is." But apparently, it's just raising hands like you're saying hooray for somebody.

39:13 Okay, that's nice. It's good.

39:15 >> I use Emojipedia all the time, but I think I use it in the opposite way.

39:18 I use it to get an emoji to put somewhere because I don't have an emoji keyboard or whatever.

39:24 >> Yeah, that would be good too.

39:26 The other thing I wanted to bring up is I hopefully have some cool news to share tomorrow about the Pytest book and the news will show up on a revamped Pytest book site.

39:38 So if you go to pytestbook.com, you get redirected to this pythontest.com page where I'll talk about the second edition.

39:48 So hopefully there'll be news about the second edition coming out tomorrow.

39:52 And I...

39:53 Is there any static site magic?

39:55 Yeah, yeah.

39:56 Static site.

39:57 And I totally...

39:58 And it goes dark and light.

39:59 But I totally stole from Prajan.

40:01 So Prajan has the same, he's got a really nice site.

40:06 So it's a bunch of great.

40:08 It looked great. I'm like, "That'll work.

40:10 I'll just do what he's doing." So that's what I did.

40:12 >> Yeah. Very cool.

40:14 >> I think we have exactly the same stack for our Saturn Cloud site now.

40:18 >> Oh, how neat.

40:18 >> That's cool.

40:19 >> Awesome. How about you, Julie?

40:21 Anything else you want to give a shout out to?

40:23 >> Well, I've been really into entry points recently.

40:26 Just like the concept of them is very cool.

40:29 - As in like Python packages, you can give them almost like CLI command type of entry points?

40:34 - Yeah, but the thing that I think is really cool is like, like, like matplotlib, this is an example that made me first realize about entry points is matplotlib has this dot plot.

40:44 I think I mentioned this three times now.

40:46 But you can swap out the back end, so you don't have to have matplotlib.

40:49 You can use other back ends.

40:50 And all the logic for that is in the other visualization libraries themselves, not in pandas.

40:58 So it's just like, you can swap out other things.

41:02 It's not just for CLIs.

41:04 - Okay, yeah, how neat.

41:06 All right, yeah, I learned about entry points a year, year and a half ago, and ever since I'm, oh yeah, this is awesome.

41:11 I can now create these little commands that'll be part of just my shell.

41:14 I love it.

41:15 - Yeah, the other thing I wanted to say was the GitHub CLI is really cool.

41:18 I think that's standalone, but it's, I've been using it a lot.

41:22 - I'm sure people know the Git CLI, What's the story of the GitHub CLI?

41:27 - Oh, well the GitHub CLI is, makes it, so if you have ever tried to check out a branch on someone else's fork, like if you want to evaluate a PR that someone has put on a fork, that is the situation where the GitHub CLI is really great 'cause you can just do like gh checkout PR or gh PR checkout whatever the number is and that you're just on their branch then.

41:52 And if you can push, if you have push access to their branch of your maintainer and they've allowed it, you can just push directly and you don't, I mean, I was always looking at that sequence of commands before, like, I know people have like git aliases and stuff, but yeah, I'd really recommend checking it out if you do a lot of GitHub stuff.

42:09 - Okay, awesome, yeah, that's great advice.

42:11 - Yeah, I often wanna like check out some, so pull requests, I wanna be able to like play with it and run their code and so, yeah.

42:18 - It's the best.

42:20 - Yeah, awesome.

42:21 All right, I got a couple things to add.

42:22 By the way, first of all, just that first practical SQL analysis that you talked about.

42:26 It also is a similar theme that you were talking about, Brian.

42:30 One of the things I thought was cool, though, as you scroll through it, it has a progress bar for reading at the top, and that just made me so happy.

42:35 I don't know why.

42:36 That was really neat.

42:38 All right, but I have a bunch of hear all about it sort of things.

42:40 So really quick, Python, B2, I just got the sense, yeah, okay.

42:45 Live update.

42:46 Python 3.10 Beta 2 is out if people want to check that out, and you can go download that.

42:52 It also highlights all the major features like the pipe operator for writing unions and type specifications and a bunch of other stuff that people might care about.

43:02 A structure pattern matching is probably a big one.

43:05 Yeah.

43:06 Go to the completely different.

43:07 Down.

43:07 Is that on here?

43:09 And now for something completely different.

43:10 I love that part.

43:11 So right above the files.

43:12 Yeah.

43:14 Oh, interesting.

43:16 The Aaron Fest paradox concerns the rotation of a rigid disk in the theory of relativity.

43:22 It's original 1909 formulation presented by, yeah.

43:25 Okay.

43:25 That is unexpected, but very cool.

43:27 And completely different and irrelevant.

43:29 Yeah.

43:30 Yeah.

43:30 Awesome.

43:31 Okay.

43:31 So takeaway three 10 beta two is out.

43:34 People can check that out.

43:35 There's also some security patches for Django.

43:37 So be sure to check that out.

43:38 One thing that surprised me is the Microsoft install Python from the windows store is already like has a three 10 beta store install.

43:49 So, okay.

43:50 That's pretty cool that they're keeping that up to date.

43:52 >> It's rated E for everyone.

43:54 >> Yeah, even kids can pip install.

43:56 Awesome. Frederick Bankston sent a message in response to our last show where we talked about the method overloading by type.

44:05 If it takes an int or a string, it calls different functions.

44:08 It's also pointed us towards this multi-method other library that is similar so people can check that out. That's cool.

44:14 >> Yeah, neat.

44:15 >> Speaking of the GitHub stuff, I've been starting to use PyCharm 2021.2 early access version, early access program version one, and it's been working fine.

44:25 So if people want to try out the new features, there's a bunch of cool stuff.

44:28 You have support for Python 3.10 and new stuff for pytest.

44:32 I don't remember if this came in here, but one thing that I did learn about that recently that's in there that's super cool is they have in PyCharm, if you log in PyCharm into your GitHub account, there's a pull request section and you can just click it and it'll do those same steps that Julia was talking about.

44:50 Like right there in PyCharm, just go, I wanna try that PR before I accept it and just click that and go.

44:56 You can even have comments, you see the conversation inside there and everything is cool.

45:00 - Never go to the GitHub again.

45:02 - Exactly.

45:03 And just forget how to use it basically.

45:05 All right, that's it.

45:07 That's all the items I got.

45:08 So yeah, I've got other stuff that's just hanging around from before.

45:11 - Cool.

45:12 - All right, well, you wanna close it out with a joke?

45:14 - Yeah, always. - A couple of jokes?

45:16 Always, all right.

45:16 So over at upjoke.com/programmer-to-ask-jokes, you'll find many bad jokes.

45:22 Some even that are not very appropriate or whatever, but there's a few that are funny.

45:26 So I pulled out three here.

45:28 I'll do the first one.

45:30 Brian, you can do the second.

45:31 Julie, you can do the third, I guess, if you're up for it.

45:33 - Okay. - So this one, we should have saved for six months from now.

45:36 But I asked a programmer what her New Year's resolution would be.

45:39 She answered 1920 by 1080.

45:41 - That's so bad.

45:43 - No, that's awesome.

45:44 - It's really bad.

45:45 All right, well, you got to do the next one.

45:47 - How does a programmer confuse a mathematician?

45:52 - I don't know how.

45:53 - Just saying that X equals X plus one.

45:55 - All right, Julia.

46:00 - Okay, why do Python programmers have low self-esteem?

46:04 They're constantly comparing their self to other.

46:07 - Also bad.

46:11 Probably the worst, sorry we gave you that one.

46:13 - That's okay.

46:15 I saw the one that Brian did and I was like, oh, it should be X plus equals one.

46:19 And I was like, no, that ruins the joke.

46:21 (laughing)

46:22 - Exactly.

46:23 - Yeah.

46:24 Yeah, I actually often do the slow way or the non-obvious way.

46:30 - The proposed way, yeah.

46:31 - X equals X plus one, just to make it more obvious to people reading it sometimes.

46:36 - Yeah, yeah, no, I agree.

46:38 - Yeah, at least it's not C++ with X, plus plus X.

46:43 - I love that.

46:44 No, no, we should have that.

46:47 - I'm okay with X++, but not that also ++X.

46:50 - Oh, the pre-increment.

46:52 - Yeah, the pre-increment, the slight.

46:53 - Pre-increment's weird.

46:54 - Yes, exactly, exactly.

46:56 But I could go for it, X++, come on.

46:58 All right, well, Julia, thanks for joining us this week.

47:01 And Brian, thanks as always.

47:02 - Always a pleasure, thanks, Julia.

47:04 - Yeah, it's fun.

47:05 - Bye. - Bye.

Back to show page