Transcript #319: CSS-Style Queries for... JSON?Return to episode page
00:00 Hello, and welcome to python bytes where we deliver Python news and headlines directly to your earbuds. This is Episode 319, recorded January 17 2023. And I'm Brian Aachen. And I'm
00:11 Michael Kennedy.
00:12 Well, I'm super excited to talk about whatever you have to share with us. But before we go, before we get started, I just want to say thanks to Microsoft for startup founders hub for sponsoring this episode. Listen to their spot later in the show. And let's see what you have to talk about Michael,
00:27 what do I got to talk about also, when remind people they can go to Python bisetta FM, like on the live stream and see all the upcoming live streams and be part of that so it's always awesome to have them there and follow us on Mastodon we got all of our our things there, believe it or not, we actually do a couple things on Mastodon every now and then. But what I want to do is actually talk about this article that Ned Batchelder wrote, which I found on Mastodon when we did our tools for READMEs, and other repo page types of things, called the secure maintain our workflow. So Brian, we got to we got to judge the level of paranoia here. Do you? Do you worry about people get into like pi test check? You worry about people getting on your computer and accessing SSH keys or or things like that? I
01:11 don't? I don't know if I should, but I don't?
01:15 Well, Ned does. And I, I share some of his concern. You know, like, on my hard drive, I have SSH keys. If you could figure out what computers those went to, you could remotely log into them, there's a few layers of indirection that make that more difficult than you would imagine. But still not not that tricky. And there's been a bunch of issues. For example, let's see, there's the circle CI breach, I believe it's probably a fair search term, because CI is super scary, because they build the things that companies ship. So if if you ship, a website or a mobile app, or you ship a desktop app, or any of those types of things, it's automated potentially through circle CI, and you send it out. So if somebody say were to take over your circle CI, that would be bad. Believe what happened was somebody had gotten ahold of someone's who works on circle CI gone into their GitHub account, right? That could be through an SSH key or, or when you're on your terminal, you could just type git add git push all those types of things, right? So Ned says, Well, what can I do so that if someone did get access to run code on my behalf, that it maybe wouldn't be able to push directly to coverage up high? And like, just start going out? And that's the next thing is like, once that goes out? That goes to everybody's servers? Many of them anyway, right? And then you potentially have bad code running on people's servers. So that like the consequences, not just oh, Ned might get hacked, but everyone using coverage.py, which is many, many, many, many, many people might get hacked, right?
02:50 Yeah. And it's also used on developer workstations. So it's going on developers computers as well.
02:55 Exactly. And then rinse and repeat. Right? Now they have SSH keys to that, what are they building and Ana, you know, like it goes, it goes sideways fast. So he's like, Well, I have this I've terminal session that have implicit access to credentials, peipsi, get, and so on, it would be better. You know, for example, you push to get without asking for a password, right, either through credential cache, or an SSH key or something like that. This is problematic in a couple of ways, the less likely less concerning, although a lot of advice sort of worries about this, I agree that it's not very concerned at all, as somebody actually gets physical access to your computer. So I don't know what most people do. But you should be turning on full disk encryption, especially if you have a laptop, right? If it if it could be stolen, or especially if you travel around with it, and it could be lost somewhere picked up lifted off, like the subway or something, you don't want to be able to just take the disk out and read all the data off it right. So a super easy way to do that with low overhead is like File Vault, which is built into macOS, and I'm pretty sure Windows has something built in. So anyway, full disk encryption. So chances are something bad happens. There's it's really, really small. On the other hand, though, is if you run some evil code. Now evil code could come from traditional places like spam, or fishing or those other areas. But for developers, especially people maintaining popular projects, like coverage.pi, and the many other things that net does is somebody could somebody could try to send him malicious code through Python and through source control. For example, what if somebody says, hey, MIT, I've got this issue with coverage.py Check out this repo and run it to see the bug to reproduce it. Like know what that might do? Well, whatever Ned can do on his computers what it might do and he says look, if I get a huge repo, not a PR to cover shop, hi, but a huge set of code that covers.pi is applied to you know what, what does that potentially going to do? He can't go code review every huge PR or that is sent to him when it refers to someone else's repo. All right, so there's things you can do but but it's that's, that's his primary concern is how do you deal with people sending him bad code? So first thing is one password one password is awesome. Also, not LastPass. Don't use LastPass more than that at the end, but oh my god, don't use LastPass one password or bit Warden are really good choices and says, Look, I store my credentials in there. And then you can have to shell functions that will load those variables into and out of environment, the environment just for a moment. So load the GitHub credential into the environment, do a git push unloaded, for example, something like that, right? That's pretty cool. Similarly, things that are very less likely because you're like peipsi credentials, right? How often do you really do a push, it says, but also I have SSH, a dot SSH directory, which on Mac, and I think Linux as well as where the default SSH keys just live unencrypted hanging out there. So that would be something you want to keep away. Now he says, I don't know what to do with that. The comments here are very helpful. But the other thing is, if I've got to run that PR, and somebody gives me some huge bit of code, I'm running that in Docker. So get one of the base Docker files for Python log into their interactive shell, get clone, try it out. So you know, who cares if somebody hacks your your Docker file, right, or your Docker container? You're gonna throw it away anyway. Right? So he asked, What else can I be doing to keep safe and luckily, there are comments on his. on his blog here it says you could piggyback on the one password workflow to export extra SSH config, and go down here, Dirk Shawn says I use secretive which keeps SSH keys on the Mac locked up some comments for protecting Docker, although I don't really see any reason I would care about protecting a base Docker image. But crucial DOS, another core developer says one password can do SSH. So one password will run an SSH agent that will serve up the keys on demand but like prompt you for a fingerprint reading or verifying your your watch or enter your one password password type thing, which is cool. And you also suggest using pod man which has higher security than Dr. Again, I'm not sure why you need that. But finally, Brett Cannon says one password for SSH. Let's go. That seems pretty awesome. So interesting. Anyway, anyway, these are some ideas. I think it's only scratching the surface. But yeah, and then Christopher, just follow up, says BitLocker is the File Vault equivalent for window? That's right. Thanks, Christopher.
07:31 So one of the things that I mean, okay, so yes, protecting against, you know, losing your laptop or somebody taking it or reading your whatever. These are kind of cool. One of the my concern isn't really that somebody's going to try to access it is that I can't anymore. Like, like, my laptop just dies, and I can't use it anymore. So things like one password, I assume their backup a bowl so that I can get access to it. Yeah, so
07:57 one password stores all that information on their servers. Were where you control a super long encryption key that they don't have. So if you lose it, there's no I'll get my thing back part of the setup process for when passwords, they're like, here's your 30 character secret key that is combined with your password. And if you don't have both of those, we can't help you. It's encrypted with this and we don't know what it is. So it's, it's pretty good. It's pretty good. It's not LastPass. Again, which we'll touch on. Right? So that seems to like it syncs to your phone, it syncs to your different computers. There's a web version, it syncs it works on Windows, Mac or Linux, it's a pretty good option honestly, it's paid but it's it's not much like five bucks a month. You don't want that bit Warden but bit Warden is not quite as secure. Because I don't think it has the secret key. It's just the password. So you need a longer password. And it will go in bit down too far down that rabbit hole maybe. But yeah, it's it's pretty interesting, and certainly is a concern. But so for example, you can have file attachments in your one password. So you can attach like your SSH folder to like a logins thing that you put in there. So if I go to a new computer, you can just, you know, open that thing up and get your SSH keys, drop them in there and off you go. Great, but never never lose that. That 30 character secret key because you're not getting back in without him. Alright, over to you. What do you got?
09:20 What do I have? I've got some web scraping so or tool. I actually have a couple tools for parsing HTML and parsing JSON that I thought were just pretty darn cool. So I was reading this article, which is a decent article called a year of writing about web scraping in review. So somebody that got a got a, you know, job doing a whole bunch of blog posts about web scraping. But one of the things when he talks about doing it in Python had the HTTP HTTPS and yeah you and I both like that alive that's great stuff pretty popular, but I hadn't heard of parcel or Jas path or Jas path is j m e s path If and so I wanted to check that out. These are some pretty cool tools. So what Parcell does is it's a Python library to extract and remove data from HTML and XML. Sure, I guess I'm using XPath and CSS. So the CSS part is the part that I'm excited about. But so the idea is like, here's, here's an example bit of HTML that we're showing on the live stream. And, and you could just, like access elements, like you would CSS access, like, you know, h1, colon, colon text, sure why it's colon, colon stood dot, but anyway,
10:38 I think those are, what are they called, like special classes in CSS? Okay. The text the text is that but you can do things like h1 colon hover, and that like only triggers when it hovers. But yeah, colon colon taxi. Right, that okay, I get it. That is weird.
10:54 But anyway, kind of interesting. Like the and then I'm used to the like, the greater than I think that's like some child love or something.
11:04 Immediate child. Immediate. Yeah, it has to be made a child. Yeah.
11:08 Okay. But but there's, it's fairly clear to read then to be able to pull out some some some stuff out of your HTML using these selectors. So that's pretty cool.
11:18 Yeah, that's really nice. I've always thought of beautiful soup for that. But that sounds really nice.
11:23 The other the other one that I thought was great, and which I probably do more often is grabbing JSON stuff out of JSON. And so I hadn't heard of Jas path. And it's just something pretty cool expressions to be able to pull out some stuff. So if you've got like, this example of foo, and foo is a dictionary element, and it has another dictionary inside with bar and the value of baz, you can just say foo dot bar, and it'll return that. So those are pretty cool. Just simple, the simple little tools about getting JSON data. So that's
12:01 interesting, because I never really thought of parsing JSON with like a search. Yeah, with a query CSS, like search, I've always just thought of as well, I'm just going to load it up and navigate it. But this is, I just want to go to this section and grab this array, and I don't care what's in the middle.
12:18 Yeah, well, and actually, so I need to play with it. So you're right. I've never really thought about too much about doing searches or something. I just, like loaded up and just navigate it. But but if it's somewhere buried deep inside my document, I wouldn't know how to get it. So. Yeah. Or you're possibly if it changes over time. So it's like, you know, there's a component on the site on the page, but it's, it might be loaded anywhere on the page.
12:44 Yeah, yeah, exactly. Yeah. It's kind of like a CSS selector for JSON, which, that is a cool discovery. So
12:50 anyway, that's it a couple of short items, but nice out there.
12:53 We'll McGuigan says pseudo classes. And yes, pseudo classes, for sure. Absolutely. So that's like the colon hover and stuff. But these are all like colon read only colon valid colon, you know, these colon visited. But I don't know about the double colon. Maybe that's something else. Maybe it's just a special specialization of pseudo classes.
13:14 But I don't have to dig and dig into a little bit more.
13:16 Same. I've only been doing the web for like a few weeks. I went to this boot camp, I'm getting getting good at HTML. Tell us about our sponsor. This episode
13:25 of Python Bytes is brought to you by Microsoft for startups. So Microsoft for startups is built founders hub to help startups be successful. Founders hub provides founders at any stage with free resources to help solve startup challenges. That digital platform provides technology benefits, access to expert guidance, skilling resources, mentorship, and networking connections and so much more. vendors have is truly hope and open to all along with free access to GitHub and Microsoft Cloud with the build with the ability to unlock credits over time. founders have also has partnered with other innovative companies to provide exclusive benefits and discounts. You will also have access to their mentorship network, giving you access to a pool of hundreds of mentors across a range of disciplines. You'll be able to book a one on one meeting with mentors, many of whom are former founders themselves. Make your idea reality today with critical support you'll get from Microsoft for startups founders hub to join the program, please visit Python bytes.fm/founders habit 2022. The link is in your show notes. Indeed.
14:32 Thank you, Microsoft for sponsoring the show. Let's see, what do I got next back to get? But this time not protecting git understanding your Git repository? Brian, do you know what your largest git repository is in science? No, I don't either. But I'm pretty sure that Talk Python Training, the website is just under a gig. And that's quite a bit so maybe I haven't looked at all the others but that one's one of the larger ones that I managed and it's pretty big, but Is it big? Because that's a bunch of binary stuff that I should maybe find and remove? Is it big? Because there's just a lot of files.
15:07 If you have directories named like backup one, backup to storage,
15:11 no, no, no. version, version one, version two, version one final version. Final, final fact, I zip those. I don't just have enough directories. Yeah. So anyway, what I want to tell you all about is a tool called Get sizer so it computes various size metrics for Git repositories, and pointing out aspects of your repository that might cause problems. So if you've got a small repo that cook errs, don't worry about this stuff. But on the other hand, if you got one where it's like, I think it's a pain to checkout or CI builds are really dragging because of this, this segment, if not necessarily this tool that will be helpful for you. So I recently did an episode on mono repos, David V, Ik, and we, we uncovered a bunch of cool tools. One of them is this git sizer, because mono repos are like, I don't just have a repository for this project, I have a repository for the company and all 100 people put all of their projects into that one repository, which is a bit of a mind bender. But if you do stuff like that, you need to think way more carefully about how you work with files and get and so on. So you can ask questions like is the repo too big overall, ideally, it should be under one gig. Well, actually, maybe I'm over by like a couple bites, but whatever. And they started to get out of control at five gigs, like Git doesn't behave well, sort of thing. So you can do things like avoiding compiled output. So if you have JAR files, right, or wheels, I guess in our case, we have less compiled output. But if you have, say, like wheels, and you want to keep a version of every release, maybe don't store that in Git, maybe store that somewhere else and link to it and get, I don't know, you also use get large file system, instead of putting large files directly in there and things is not very compressible, and cannot be deferred. They're very much hated by Git, because Git say, does a lot of its work by doing deltas, like, Hey, here's the main one. And then here's just the difference of these versions, I need to keep like there's one line in this text file changed, that could be what's stored, right? But if it's just a binary thing that can't be deaf, then that's always a copy. Yeah. So you can go through here and download it to get started. It says, you run it, it'll tell you things like processing, you know, go through the blog, basically analyzing the system and his overall repository size, here's how many commits that are, how big the commit history is, here's how many trees as in folders and stuff, and how overall size them and the blobs, right, so this one has 55 gigs of blobs, 1.6 5 million blobs, that is a serious, serious bit of history there, whatever this project they ran it on. But he will go through and tell you, you know what's going on? And yeah, you know, you can sort of look in and get a better understanding of what's happening in your Git repository.
17:54 Cool. Something else that I'm sure everybody knows this already, but especially in CI and stuff, it's helpful to when you're going to clone a branch to clone it to depth one so that you're not cloning all the history, you don't need it for Yeah, so
18:10 I have, I have some interesting, interesting, newer version of that guidance for you, Brian. Okay, so I was looking at watch some of the presentations at GitHub universe. And so what you're talking about is what's called a shallow copy. So it says we're only going to pretend that there's three, three commits deep in this branch, in which case, you only see like three level three commits worth of history, and so on, right, but it is much smaller, because it doesn't keep all that all those files over time. But you make the trade off that you don't have all the history, if you wanted to go back and read that potentially even checkout back to one of those, you'd have to like delete the thing or check it out again, and somehow. And another less shallow way to what you can do that's real similar is you can do a partial clone. I don't know how how I ended up on this page. But there's docs for it on the GitHub documentation page. So with a shallow clone, all that it will check out is the like depth one that started with partial clone is all it will check out is level one of files, but all of the commit history and messages. So like this file has 10 changes, and here's the messages, but it doesn't check out those nine others. But what's cool is if you were to like switch a branch or go back in history, Git will on demand, download that other one. So what you end up typing as you type something like Git clone, dash dash filter, colon blob equals none, or you can even say I want all the files except for anything over a certain size that might be in history, so anything over 100k But other than that give me every file forever throughout its whole history. So this partial clone is a is like a similar type of thing. But it's a little more flexible in that it's like a transparent proxy to the full history of the repo without cloning it, so it makes sense.
19:58 Yeah, so that might be a good workflow. For like in a development environment, yes, exactly. Yeah.
20:02 If if you're not hoping that you're going to be able to go offline, and completely disassociate yourself from GitHub, right? If you're assuming I still have online access, and I don't want this folder to be the true, complete full history forever of the repo, I trust that it's other places, then you'd be totally good. The one place where the shallow clone would be really awesome. Is for CI. Yeah, that's what I was mentioning. Yeah. Yeah. If you're doing it for CI, then like, your CI system does not care what the history is, it only cares what the current is. Right? So shallow clone. And then similarly related to that is you have sparse checkout. So you can say I know there's a huge repo, but I just want these three directories and stuff under them. And you can mix that with a partial clone. So you can like combine these, I only with a partial clone, but of just these three directories, even though there's 1000s.
20:51 Oh, right. And well, in some companies do the whole like Mega repo thing where it's exactly the one.
20:56 And that's where it would matter, because you're like, Well, I don't want to check out seven terabytes, I don't whatever it turns out to be. So anyway, there's like a couple of interesting things. I present to you all the Git sizer to give you a little bit of advice. But then also some of these other tools to help you deal with it more. If you are already in this realm, partial clone shallow clones, and sparse checkouts all might be tools you can apply what are just built in to get that make this? Yeah.
21:23 And also LFS is not that hard to use. So if you really have to use LFS Yeah,
21:28 I have one other thing for you. I did this on the Talk Python Training, I did a partial clone. Well, filter blob equals none, ball, blob, colon, none. And without. So without I had the deltas was 71,000 deltas and 118,000 objects with 10,000 objects 1400. Deltas, much, much faster checkout. And like I said, it's kind of on demand, it'll go get the older files if it needs to. Cool. Anyway, it seems like it's a pretty handy. Lots to get. What are we getting to next? Oh, we've
21:59 right now we've got bad advice. So I guess this may be under category of Do not try this at home or just don't listen to Brian. But it wasn't me. It was this other guy, Adrian. So this is a fun article called Data classes without type annotations. So I'm using data classes a lot. Now I like and adders, too. I like both adders and data classes. But anyway. So apparently, I didn't know this. But data classes don't really care what the type is.
22:33 You can put a type but it's not. It doesn't use the type.
22:36 It doesn't use the type at all, apparently. So you can, you can do something like dot dot dot, for instance, is the type. And you can do some crazy things this so that doesn't even make any sense. But apparently works fine. And I'm like, I don't believe it. So I tried it in there, right? Doesn't do doesn't do that. So there's a whole bunch of discussion around types here and type hints. And in some people just kind of are, they don't want to use types. And that's fine. But but if you wanted to use type class or data classes, they kind of require you to use types, but apparently you can get around it in. And I just really wanted to show this horrible example of code. And there's a there's a data class that is called literally and it has a variable it'll anything with the type is a tuple with two strings in it saying can go in here. And we've got other variables with like lambdas expressions as types. And in also a try I tried this you have to put from future import annotations in your in your file, but then you can put all sorts of horrible things in there. It doesn't even have these these symbols don't even have to be anywhere in your file. As long as it parses it works fine.
23:55 For example, The first type is a tuple. It's not saying the type is tuple. It just is a tuple like parenthese string, comma string. Yeah. The second one is a lambda, where the Type value would go and so on. Yeah.
24:11 Is that even a valid lambda that Well, I guess it is, you can have a lambda
24:15 parameterless lambda
24:17 that only returns a string. How about not not as an expression for a type not even evaluate even dot evaluate was is it is currently valid type. And then the last one is just awful. Just dot has as two equals B as the parameter to has and multiplied by syntactically bracket valid. These are this is a nightmare, but it parses fine. So
24:47 crazy. Yeah, your editor might not like it. My pi might not like it.
24:51 Yeah. But, but there there is some discussion of things that might be useful about this. Like if you're really not using Add type annotations, but you want to use data classes, you perhaps want to put some strings in there as the type to declare is a comment for what the thing is, instead of, you know, I don't know, this is bad advice. Don't follow this, but it's fun.
25:17 It does break some conceptions that people might have about data classes, unlike, say, Pydantic, where this stuff matters, after like data classes, but apparently validated. Yeah,
25:28 and apparently this so this was apparently popular or enough this was written last year. And if you want to try to do something similar to data classes, where it's similar to adders where you have like an attribute or something, apparently, there's this other you can say a type plus a type plus data class, and you can just say it's a field and and get around it. And this is, this is available in a IPI package.
25:56 Okay, cool. typeless data classes fun. Alright, well, is that all of our topics? Brian,
26:04 I think it is you I don't have any extras either. Do you have any extras for us?
26:09 I have to. Let's see, what have I got going on here? So my notes in the show show notes. My comment is the LastPass story just keeps getting worse. What I have on the screen here, what I linked to does not fully communicate the degree to which it has gotten worse. So keep that in mind with LastPass. It turns out that someone guess what broke into the GitHub repository of a developer sometime last year like November, they then use that access to further their access, and eventually got the ability to copy every single customer's users last pass encrypted Vault, which sounds terrifying shouldn't be, but it does, because that's theoretically encrypted with your big long not reused password. That's a big if but it should be right. And then it probably should have some kind of secret key type of thing like one password. So even if my password was the letter A for one password, it's still 27 characters. For the person have this perspective. LastPass doesn't have that it's just the letter A. So that's not ideal. There's some posts like oh, well, don't worry, it's gonna take like 100 years to decrypt this. If if if there would be if they had the latest settings, which are like if I just created a new account, and it used like 100,000 iterations of folding the password and other things about how long the password has to be. But here's the the getting worst part, the older versions didn't enforce that. And they didn't use password folding. Instead of using 100,000 or a million iterations, they used one. So instead of taking 100 years, or whatever it is to decrypt it with a regular sort of cracking GPU system, it takes about 25 seconds to crack the password. And those passwords are versioned and changed in this way, depending on when you last use them over time. So if you created a password 10 years ago, but then changed the settings, I don't think it goes back not 100% Sure, but I think it's still you can have historical older passwords in your vault that are like that. On top of that, it turns out that things like the URL of where that password belongs to and your email address were stored in plaintext. Not to that word not to with one password, but with LastPass stored in plain text. Well, it's not plain text, its base 64 encoded, but we all know what that means. Just not readable by humans. But that's plaintext. So and when was it last access. So you can do things like I want to go to the vault and see who has accessed some shady site, like who has access Tinder, but also seems to be married? Can I blackmail that person without even figuring out what their password is just say, Look, that's a little shady. I'm going to tell your wife about your Tinder account, you know, so I mean, there's all sorts of really bad things plus, I can see that some of those passwords are going to my bank with a with a password full, you know, one or 500 or 5000, which I can just break straight away. So you can use the unencrypted bits to target which ones you want to go after is really bad. So just PSA if you use LastPass. Change your passwords. Yeah. appeared because this is out there. And it's in plain text, except for the password. Not ideal. Okay, so anyways, I figured that was bad enough. I wanted to kind of point out I know because a lot of people I don't worry, it's super encrypted, like, sometimes. Yeah, but
29:26 who will actually do it for 25 seconds?
29:29 I know Come on. I mean, if it just said no, like Bank of America account.com/logon I don't know. That worked 20 seconds probably. I mean, banks are like kind of a unique case because they often have like a to FA or you know, what's your cat's favorite toys name or whatever, right? You gotta answer but like there's many places that aren't don't have like a some kind of second check like that.
29:50 And my bank even has the feature of even if I tell them to remember my device, they won't. So I have to to FA every single time I
29:59 know It drives me nuts. It drives me nuts. All right. The other thing is I woke up this morning with a couple $1,000. And unfortunately, by the time of the recording, I no longer have that money. But I have a new Mac Mini come in Brian. He's just announced new Mac Mini him to grow and the new MacBook Air pros and maxes and all that. So we are waiting.
30:24 It says 600 bucks. How Where are you down? 2000?
30:27 Wow, let us go. Let us go on on the passive 600 bucks is the m two version, which is basically the upgrade of what I have now, which is awesome. But what I want is the empty Pro for all the video editing, which is like 1300 to start but then you're like, you know, I really could use a little more. A little more RAM. And it's that all the video and podcast stuff. I need some more storage. And all of a sudden it's like, oh, I'll just sell my car. I'll get a mini that'd be cool. But anyway, I'm very excited about this coming out. I'll let people know what I think when I get it. But I'm sure it'll be lovely.
30:59 Cooper Mini or Mac Mini. They're about the same price Exactly.
32:23 Yeah. Or it could have been used so once before, and then somebody refactored the code and forgot to delete the declaration.
32:30 Exactly. Then then you live a long time. I will. That's it.
32:33 All right. That was funny. So thanks a lot. And thanks for joining me again today. And thanks, of course to Microsoft vendors up to sponsor us.
32:41 You bet. And thanks, everyone for listening. Bye. Oh, hi.