Brought to you by Michael and Brian - take a Talk Python course or get Brian's pytest book


Transcript #276: Tracking cyber intruders with Jupyter and Python

Return to episode page view on github
Recorded on Tuesday, Mar 22, 2022.

00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds.

00:04 This is episode 276, recorded March 22nd, 2022. So many twos. I'm Michael Kennedy.

00:13 And I'm Brian Okken.

00:14 And I'm Ian Hellen.

00:15 Hey, Ian. Welcome to the show. It's great to have you here.

00:18 Thank you very much. I've listened to the show a lot and feel very privileged to appear on it.

00:23 It's our privilege to have you here. Thank you so much for listening. And I know you got some

00:29 cool stuff to share. So we're looking forward to hearing about that. Also, I do want to say

00:33 thank you to Fusion Auth for sponsoring the show. I'll tell you more about them later.

00:39 Before we get into the topics, Ian, tell people a quick bit about yourself.

00:42 Sure. I'm a developer in Microsoft, the Microsoft Threat Intelligence Center.

00:48 Been with Microsoft for quite a long time. Only relatively recently, like four years so ago,

00:52 got into Python coding with Jupyter Notebooks. So I work on Jupyter Notebooks for the Microsoft

00:58 Sentinel project and own a modest open source package that we'll call MysticPy, which we'll

01:05 cover a little bit later. Takes most of my time.

01:07 Fantastic. The whole cybersecurity threat detection stuff, it's very interesting. There's a lot of

01:14 innovation there, but it's also, it's a challenging area to be working.

01:17 Yep. Yep. We're never sure of stuff to do.

01:20 Certainly. I'm sure you're not. Well, Brian, how about you kick us off here?

01:25 Well, so I'm going to start off with a problem. So I, I had a problem and I have a cool solution for it.

01:30 So my problem is on test and code, I've got titles and I want to end a show on it. It's MP3 file,

01:39 but I want to create a show notes, automated show notes or not show notes, a transcript.

01:45 So one of the problems, there's a lot of problems in doing this, trying to automate it, but one of them

01:50 is the title. I want to turn that into something that's a little bit, so something like, you know,

01:57 it's got normal English and capitalization and all sorts of spaces and stuff. I want to turn that into

02:03 things that URLs hate. Yeah. I want to turn that into a URL. And, and one of the problem,

02:09 one of the things is getting rid of stop words. So there's a bunch of stuff like lower casing.

02:13 I can do that easy, but getting rid of stop words was a little hard. So I ran across this,

02:19 this thing called gen sim parsing, pre processing thing. So pre processing. So gen sim is a larger sort

02:28 of beast. It's a, it's a used for machine or machine learning and stuff to generate models. But I am,

02:36 I'm just really using one little piece of it, the pre processing part. And it's, it's really pretty

02:41 cool. I was looking, I actually found this article first. There was an article called removing stop

02:47 words from strings in Python. And it has, it has a discussion of NLTK and gen sim and spacey. I tried all

02:55 of them out actually. And the one that really stuck best for me is a using, talked about using

03:02 remove stop words is exactly what I wanted right from, from gen sim. So I went ahead and tried that

03:09 and it worked really well, but I'm like, wait, I'm pulling this is in from the pre processing library.

03:15 I wonder what's what else is in there. And there's all sorts of really cool stuff in here. there's a

03:22 lower lowercase to Unicode. It turns it both into lowercase and in Unicode. That's pretty neat.

03:27 Don't think I need it, but that's neat. but then there was one that was, pre, I thought

03:33 maybe this is exactly what I want is, something called pre-process string. And it has a whole bunch

03:38 of filters built into it. Oh, nice. Like strip strip. Yeah. Strip white space, strip punctuation. I love it.

03:45 Yeah. And take away multiple, after it strips punctuation, like you're going to have,

03:49 if I go back, I had a slash in my title for one of the episodes. If it takes that out, I'm going to

03:56 have a space before and a space after. So I want to remove those. So it'll strip multiple white space

04:01 strips out numerics. Cause I probably don't want numbers in there. and then remove stop words.

04:07 The one thing I don't want that I'll have to like, customize how I'm calling this is a stem text.

04:13 So stem text, I didn't, I don't know what that did without playing with it, but what it does is it

04:18 would take things like twisted and turning it, turn it into twist. That's, that's really not right.

04:23 So you definitely don't want that. I don't want that. I don't mess it up, but I think I want

04:26 everything else. So, this gen SIM, library has, you know, if you're doing machine learning,

04:32 coming up with models, I think this is a great, tool to look into, but if it's actually,

04:39 I'm going to use it just for, removing to create these titles for, for, you know, my podcast,

04:45 but the, I think it, it feels a little weird. It feels like I'm using this really big hammer to do

04:51 this little tiny problem. I guess I'm okay with it, but you know, do you have any other ideas

04:57 where it could use or, well, I didn't know about this. So I wrote my own. Okay. And it's, it's,

05:04 it's kind of janky. Like it's a little bit, a little bit recursive iterative. It's like,

05:08 we'll take away all the punctuation. Now turn all of your white spaces into single white spaces.

05:14 Cause there might've been, you know, dot space. So now you've got two white spaces, but you've got to

05:18 take away, you know, there's like a bunch of weird steps and then, then put it back. This looks

05:22 cleaner. It is a dependency, but it does look cleaner. I like this. I think it's, I'm glad I

05:27 know about it. Ian, what do you think? Is it a huge thing? I mean, dependency, but, I always

05:32 think of like ML like stuff, but this is like just the pre-processing, right? Well, I'm actually pulling

05:37 in all of GenSim to get this. I don't know if I can pull in little bits, but, it's, it's not

05:42 really part of my application that I'm shipping. It's just a tool that I'm using on my laptop. So I,

05:49 I guess downloading it once doesn't really bother me too much, even if it's a big thing.

05:53 But cool. Yeah. I was thinking, yeah, that's a good, that's a good point. If it's running

05:56 local, it's like a dev dependency, who cares? Right. It's like worrying about how big pytest

06:00 is. Like it doesn't really matter. And I'm not, well, I kind of get care about that. Cause

06:05 CI is going to pull it in all the time for pytest, but. Yeah, but they got fast networks.

06:10 It's not your bandwidth. It'll be all right. One of the things that struck me about this

06:16 that made me think of your situation is like that lowercase to Unicode in so many

06:21 times in the security space. It's about like, you're checking for this representation, but what

06:27 if there's another representation that means the same thing? Like you don't say go to this

06:31 directory. You say go dot, dot. And then over there, you know, those, those kinds of non-canonical

06:35 representations. I wonder if there's any use of this kind of stuff for you.

06:39 Yeah. There's something I kind of touch on the pigment section later on, which like the

06:43 attackers typically write scripted attacks and try to obfuscate code using a mix of kind of

06:49 uppercase and putting random dots. I'm just thinking that'd be a nice, potentially a nice

06:53 way of kind of cleaning some of that, that stuff up.

06:56 Yeah, for sure. There was a, there's been some interesting supply chain vulnerability stuff.

07:01 Remember, remember the guy with the color and I think the faker stuff in JavaScript that

07:06 sabotaged his, his libraries. There was another one that maybe well-intentioned. I don't know. It,

07:14 it was some open source library. I don't believe it was Python. I can't remember what it was.

07:19 It could have been, but I'm pretty sure it was in JavaScript because that's where all,

07:22 most of the bad stuff was, it seems. Anyway, they wrote their, they, they taught their dependency

07:28 to erase everybody's hard drive who installed it, who was in Belarus and Russia, which, okay,

07:36 maybe they're trying to contribute, but like it ended up doing a bunch of bad things, even to places

07:40 that were like trying to help say people in the press and journalists do certain things and then like,

07:46 you know, connect with sources and in a race like that database as well. And what they did to make it so

07:51 that nobody would notice in the GitHub commit before it went out to NPM was base 64 encode their

07:58 changes. So I basically put a base 64 encoded string and then like decode and then run that.

08:03 And, you know, it's like that kind of stuff. I know this won't solve that problem, but yeah,

08:07 you know, that, that sort of category of like weird representations.

08:10 Yeah. You need mystic pie for something like that. It's one of the things we, yeah, it's a common thing,

08:15 kind of basics before decoding before the obfuscating. But yeah.

08:19 Yeah. Interesting. yeah, I thought of maybe using something like that with,

08:24 because one of the problems we have is like every, every script is kind of slightly different.

08:28 if you could use something like that, essentially kind of apply like sentiment analysis to

08:34 script. I mean, this is a big problem. It's just not something I've particularly solved.

08:38 but that might be a kind of useful, useful thing to just picking out certain things that indicates

08:43 malicious, like format, you know, format drive.

08:47 Exactly. Yeah. You could certainly represent like this one does hard drive stuff. Is this,

08:52 I thought it was parsing colors. Why is it doing things with the hard drive? This is odd,

08:55 you know, like, or with the network, stuff like that. Cool. All right. Well, you know what you

09:00 would really want to check out if you were trying to research these things, probably documentation.

09:04 So I want to tell you all about dev docs, dev docs.io. This is pretty cool. Now, when you get there,

09:09 it's an interesting on my Firefox, it's just got like the mobile view, which is really odd. If you go

09:15 there with a full browser, it's what it believes is the full browser. I guess it's like a slightly

09:20 different view. That's pretty similar, but not the same. So there's, if you open up a whole bunch of

09:26 programming technologies, let's say not just Python or JavaScript or something, but there's also Vue JS.

09:33 There's Vexoig, for example, like some of the foundation of flash and you can pick the particular

09:37 versions and stuff. So you can go in and like enable these different things. So maybe I care about

09:42 view. I can go over here and enable that one. Let's, we definitely want some Python. Let me go find

09:47 some Python and it gives you all the versions. I'll take that. And let's say I'm also working with

09:52 Postgres. So I'll enable that documentation. And then I might be working with engine X for the front

09:57 and which is somewhere right here. So you can go enable that. And then it will be up near the top

10:05 somewhere here. You can see these are either the default ones or the ones that I checked on. So then

10:09 you can open them up and say, I want to go and see the engine X guide about a debugging log. And then it

10:15 takes you to the documentation for that technology. So it's like a meta documentation repository for all

10:22 of these things all at once, which is pretty cool. Right? So I can go up here and search. I want to

10:26 know about like, let's go about like media tags or something. So you can see the stuff in HTML5. You

10:33 can see the stuff in when you say media, it looks like median. So you can see that in the statistics

10:38 module for Python, some stuff for CSS, or you could come over and say, look, I just want to search for

10:44 CSS. And then you get like using media queries and how to do that kind of stuff. So it's kind of a,

10:50 what you do is you turn on the pieces that are relevant to you, and then you can search across

10:54 those technologies. Cool, right? Wow. Yeah. And, and then if you're on the move, you can come over

11:01 here and turn on offline, offline data, and it'll download all of that as an app so that then you're

11:09 the coffee shopper and you're playing, you now have all the documentation for Python 310, Vue.js,

11:13 Verix.soing, Nginx, et cetera, et cetera, that you can use, which is pretty cool. And this is something

11:19 that drives me crazy about Firefox. They had it and they took it away. And I don't understand why,

11:24 because I'm feeling as firebox is about what the web. So they took away the ability to do progressive

11:29 web apps in Firefox, but all the Chromium browsers support it. So you can actually go and install this

11:35 as a dedicated application on your system. So you, if you have no web, you just click that open. It's

11:41 its own window. You can up, you know, alt tab, command tab between it. Super easy. And then turn on the

11:47 offline mode. And you basically have an app that has offline documentation for all the programming

11:52 technologies that you care about. So this is my new coffee shop buddy.

11:56 Is the search go across the thing you've selected then? So if I search for like replace or something,

12:01 it's the things I've selected.

12:03 Yeah. So if you turn on like JavaScript and Python, it would look for that in both languages.

12:08 Oh, okay.

12:08 Yeah. So basically the ones you turn on, there's a ton of them, right? And you pick,

12:12 you say, these are interesting to me and then search and stuff from what I can tell only applies

12:16 to the technologies you say you care about. Cause like if, if you don't use Java, you really don't

12:21 want to see the documentation for Java search, right? That would be useless.

12:23 Yeah. One of the things I like about this is it also has versions. So, if you're using a,

12:28 like an older version of Postgres, you can just enable that version.

12:33 Right. Sometimes it doesn't matter very much, but other times it matters massively like bootstrap

12:38 three and bootstrap five, they're like fully incompatible basically. Like they're totally

12:42 different keywords and grid systems. And you don't want just the latest. If you've got an old app

12:47 you're working on something like that. Python's more forgiving about that kind of stuff, right?

12:51 It doesn't break as often.

12:52 I was amused that the, the list though is, it has like three, nine, three, eight for Python

12:59 and it has three 10 at the bottom because one is obviously.

13:03 Cause it's alphabetically sorted. How interesting. Ian, what do you think of this?

13:07 That's very cool. I'm amazed. Is somebody at dev docs kind of manually maintaining all of the links

13:13 to these, like the original source documentation?

13:17 Yeah. Where are they getting it from? Right. I mean, cause there's, they're super disparate.

13:21 It's like matplotlib and markdown and MariaDB. These are all, it's unlikely they're all stored

13:26 in the same basic system. Right. I don't know how they get them actually.

13:29 Yeah. That's very cool. I mean, I know, I normally have solved the same problem by having like 130

13:34 tabs open to different bits of Python docs and pandas and.

13:38 Exactly. Exactly. Yeah. I'm pretty sure they got pandas in here.

13:42 They got numpy as its own thing that we saw matplotlib. There's pandas and there's even,

13:47 you know, versions of pandas across there.

13:50 Single tab solution. Brilliant.

13:53 Yeah. It looks, looks pretty good to me. All right. You want to tell us about what you got

13:57 for your first item?

13:58 Okay. Sure. Yeah. so, as I mentioned earlier, I own a package called mystic pie.

14:06 and first thing to sort out with it is the spelling because I suffer from this on a daily basis,

14:10 mistyping it, even though I've owned it for like three or four years. So it's MSTIC standard for

14:17 Microsoft threat intelligence center. There's no why or anything like that in there. So it's a tool

14:21 set for cybersecurity investigations and hunting in Python, mainly in Jupyter notebooks. So there are a

14:29 couple of questions to ask about that. So firstly, what is cyber security hunting and investigation and

14:35 what it, why are Jupyter notebooks useful? So the first one, cyber sec investigation is really responding

14:41 to alerts or other kinds of threat intelligence and trawling through typically large amounts of

14:47 security logs from cloud services, hosts, account services to determine whether this is a real threat

14:53 or not. And there are two main kinds of... That's one of the huge problems, right? Is you've got all

14:57 these different systems. How are you going to know if someone, if you don't have a tool like this,

15:02 how are you going to know that something, someone's in there rooting around, right?

15:06 Yeah. Yeah. And there are a couple of things that usually trigger this kind of search. So one of them

15:11 is a, an alert may be coming from your seam and that's a, that stands for security, information,

15:17 event management. So the, like a console, like, ArcSight is a traditional one or Microsoft Sentinel

15:24 is a cloud-based one. so you get an alert based on a rule and you need to go in a fairly managed process.

15:30 Somebody needs to go and investigate. Is this a real threat or is this just noise? or there might be something

15:35 like the solar winds, they never a year ago, the log four J, like something in the press or something

15:42 from a threat Intel kind of alert says this kind of threat is around and that's a more ad hoc process kind of hunting.

15:49 Like, do we see this in our organization? so that's kind of what mystic pie is trying to, you know,

15:55 try to address the needs of that. and the second question is why Jupyter notebooks? Why would you do any Jupyter

16:00 notebook rather than in your existing sock tools? I mean, I think there's a lot in common, this kind of

16:08 activity is a lot in common with like big science data, sorry, big, big data science. I mean, something like

16:15 astronomy where you're kind of, you know, hunting for an adversary activity is a little bit like trying to find an exoplanet

16:21 in kind of gigabytes of data or a new quasar or something like that.

16:25 a hundred thousand stars or a hundred thousand lines of log file and you're hunting for some patterns and stuff.

16:31 Right. And you've got a few photons you're trying to determine are these kind of different, you know,

16:35 something like, like, an adversary activity is a little bit like that. It's like millions and millions of events

16:39 and you're trying to find the bad stuff. so traditional sock tools are kind of, you know, can be really excellent.

16:45 And I work with one that I think is, is really good, but, but they all have limitations.

16:50 What's a, a sock tool, a sock tool, a sock security operations center. So, so something like, you

16:56 know, a console that fires alerts and tells you that they have a bunch of analysts, engineers looking

17:03 at this output of this and deciding, and that's the trigger for their investigations. They're like,

17:08 is it like failed log in the SQL server?

17:10 Yeah. Something like that. Or, you know, it could be more sophisticated thing. Like,

17:15 something's exit, you know, tried to access the kind of password data on this, or looks like it's trying

17:20 to access the password data on this host or, or has made a weird kind of configuration change to, mailbox settings.

17:28 So all those kinds of things can kind of trigger alerts and investigations. but you are limited

17:34 in most kind of operation center environments. Notebooks allow you to kind of break out of some

17:39 of the constraints of that. So firstly, you can get data from anywhere. you're not just limited by

17:45 kind of what's in your logs. You could go to virus total or so you can bring data from anywhere.

17:50 you can use customized kind of analysis. so write your own or get, get things from PyPI. Lots

17:57 of people have kind of written this stuff. you control the workflow. So, so you don't have to follow

18:02 what the tool says. You can reorder things, you can backtrack, redo things, and the workflow is repeatable.

18:08 So if you get a similar kind of, you know, issue again, or similar kind of alert, you can

18:14 fish out an old notebook and rerun the same kind of analysis. And you end up with a nice kind of

18:19 shareable document that, it describes your investigation a bit like the results of a

18:25 scientific investigation. It's like, here are all the steps I took and these are the results.

18:29 And this is what they, this is what we determined to be the bad, you know, the bad activity.

18:33 Right. The other thing that seems useful here is Jupyter. Often the notebooks will save the last

18:40 bit of computed information. And then you can go, you know, change a cell, ask the question again,

18:45 change without rerunning the whole thing. And like that's parsing tons of logs or

18:49 pulling them over SSH or whatever that not doing that again is nice.

18:53 Yeah. And it's brilliant. If you don't like doing lots of queries in different browser tabs

18:57 and your browser crashes, they've all gone. What do you do?

19:01 It's all in a Jupyter notebook. I say, it's like second by second, after you do it,

19:06 you can just go back and you can go back to things like you may have done months ago.

19:09 So, yeah, absolutely.

19:11 Yeah. So, so when I started all of this, I kind of thought a lot of this stuff for cyber

19:16 investigations would be available on, and PyPI. I thought great Jupyter notebooks seem like brilliant.

19:21 And there's going to be process tree viewer and there's going to be an event timeline and all this

19:25 kind of stuff. and I found out there wasn't, at least I couldn't find it. so I decided to

19:32 just like stop everything. Need to start writing this, this stuff. So it turns out that things like

19:37 visualizations you need for detecting exoplanets are a bit different from ones you need to detect,

19:42 bad actors. So, so we started building this thing originally me, but there's now, Pete

19:48 Brian and Ashwin Patil also kind of, working on it to my colleagues and a bunch of people in the

19:55 community. It's got four main functional sections. It's like data querying, how you get data in,

20:01 how you do templated queries as enrichment. So for example, if you have something like an IP address,

20:07 you might have a bunch of questions about it as an analyst, like which geographical location is this IP

20:13 address from does it, or any malware reports about it. third areas analysis are things like

20:19 anomaly identification, like the thing you've talking about a spike in, in failed logon events,

20:25 unusual spike in failed logon events, that kind of thing. the final area is visualizations,

20:29 and these are like more specialized. I've got kind of a couple of examples in the show notes.

20:34 this is like anomaly identification pattern. This is one of, one of the custom, we use Bokeh,

20:40 which I really, this is really nice kind of visualization package, to allow you to kind

20:45 of view data in a way that analyst kind of expects you to s to see it a bit. So they're more of this

20:51 kind of visualization than more traditional kind of graphs. I would much rather look at this than

20:56 log files or event logs or, or whatever, you know? Yeah. That's the whole thing about, you know, you,

21:00 you, you need, you may have thousands of events and you need to get down to the few that are the

21:04 interesting, the interesting thing. so one of the areas that we've, we try to focus on

21:10 currently, cause we wrote all this stuff and you have like hundreds of functions that you could use,

21:15 but it's kind of difficult to discover them. And they all, cause they evolved a little bit organically.

21:21 Like how do you, they were working a little bit of a different way, different set of parameters.

21:26 So the work we're currently doing is trying to make this all a bit more accessible. So all of the

21:31 functions that relate to say an IP address, all the questions you want to ask about it are kind

21:36 of dynamically attached to a class called IP address. So they're all like things like,

21:41 Oh, interesting. Do, do, do. So you don't have to work just with a raw string or just some raw IP representation, but you can ask it questions like its location.

21:49 Well, it's not quite that intelligent. It's even a bit less intelligent than Alexa, but, but it's,

21:55 but it's more like, you know, there might be things like geolocation of an IP address,

22:01 threat intel lookups, different queries that might be, have IP addresses like a, a parameter.

22:07 and previously you'd have to go and find all of these things and import them separately and run

22:12 them. but now they're all kind of dynamically attached as methods to the fact that use IP address

22:17 as a parameter means that you just have one object to import, and then you can do all of these different

22:22 operations, on this single item. there's, there's some things that don't work with that.

22:27 Some things like the visualizations, for example, they're not IP address or host or account specific.

22:33 They work on big blocks of data. So the other area we're working on is try to anything. It takes a

22:39 bunch of data as an input. We're writing those as pandas excesses. so they appear as methods to a

22:46 data frame. So you do kind of data frame dot MP plot dot timeline, right? And it would produce your

22:53 timeline as long as it's the right kind of data or, so yeah, that's one of the challenges of

22:57 writing this kind of thing organically is you end up with a lot of stuff, but nobody knows it's there

23:02 and everybody knows how to import it. So try to make it as accessible so that it just becomes a very

23:07 intuitive thing. Oh, I have an IP address. What functions can I do? I can do this, you know,

23:12 it's all like tab completable, that kind of thing.

23:14 Yeah, I think it's really cool. You've taken this Python data stack view of cyber security and threat

23:21 detection. Yeah.

23:22 Yeah. Brian, what do you think?

23:23 well, it's definitely a complicated area. and it trying to, one of the things I like about this

23:29 story is just talking about the complexities in API design, and discoverability that's a,

23:36 that applies to like lots of different fields, but yeah.

23:39 Yeah. It's one of those things you should have thought about at the beginning, but,

23:42 even at the end, you can tidying things up. yeah. So, Famous last word.

23:49 So yeah, we're definitely open for like other people collaborating, contributing stuff,

23:55 cause there's a lot of ground to cover.

23:56 yeah, for sure. It's on GitHub. I saw one final question before we move on. Is it just for Azure

24:04 or is, is this a thing that more broadly works across different systems?

24:08 No, I think I should have mentioned that a little bit earlier on it. We originally built

24:11 it for Microsoft Sentinel notebooks, but it supports like Splunk, Defender,

24:16 working on an elastic provider. So really anything you can get into a pandas data frame,

24:21 you can use most of the functionality. So even if we don't, we don't have a provider ourselves,

24:26 if you've got something like PySpark and you can get a data frame, then all of our functions take

24:32 data frame. You know, we use pandas as our universal data interchange format.

24:37 Yeah, indeed. Indeed. Kim Van Wick out in the audience likes it. It's way like a much nicer way

24:44 to glean info and logs and complex grip. I'm, I'm right there with you. All right. Now, before we

24:49 move on, Brian, let me tell you about our sponsor for this episode. This episode of Python Bytes is

24:55 brought to you by Fusion Auth. Fusion Auth is an authentication and authorization platform built by

25:01 devs for devs. It solves the problem of building essential user security without adding risk or

25:07 distracting from the primary application. Fusion Auth has all the features you need with great support and

25:12 a price that won't break the bank. And you can either self-host it or get the fully managed solution

25:18 hosted in any AWS region. Do you have a side project that needs custom login and registration,

25:23 multi-factor authentication, social logins, or user management? Download Fusion Auth community

25:29 edition for free. The best part is you get unlimited users and there's no credit card or subscription

25:35 required. Learn more and get started at pythonbytes.fm/fusionauth. The links in your show notes.

25:41 Thank you to Fusion Auth for supporting the show. All right. What do you got for your next one,

25:45 Brian? Number, numbers, something every computer scientist should know?

25:49 Yes. Floating point. Arithmetic is complicated. And so when I started, started working in

25:56 professionally, one of the things I was recommended reading was, an article called what every

26:01 computer scientist should know about floating point arithmetic. And don't worry, it's only like a

26:06 really long paper with lots of math. so I am not telling you to read this, although it is an

26:13 interesting read. What I would like you to read is this article by David Amos called the right way to

26:19 compare floats in Python, because there's a few things that we need to know about floats when we're

26:24 using them and floating points is, and he covers all of this in the article without going through

26:30 tons of scary math is the floating point numbers have to be represented in a way that can the computer

26:37 can store them and use them and manipulate them, even though some numbers are huge and won't fit

26:43 normally. So we have to do things like accept that there's error and rounding. So there's a little bit

26:49 of a discussion there that he talks about. One of the things that surprises people sometimes when they

26:54 first come come into Python, but it's not just Python, it's most, most languages is somewhere.

27:00 There's going to be something obvious that doesn't work like in, in Andy or David's example, 0.1

27:06 plus 0.2 equals or comparison equals, 0.3. And that will show up as false because they don't.

27:14 And this is weird. They obviously are crazy that that doesn't work, but, but it's not just equals.

27:21 You can also do comparisons like, you know, less than or greater than. So it's not only is that,

27:27 are they not equal? They're not like 0.1 plus 0.2 is not even less than or equal to 0.3. It's weird.

27:34 so, so what do you do? You don't, the gist of it is don't compare things with a normal

27:42 math comparisons if there's floating points involved. So what you want to do instead is, and there's,

27:49 here's a little tiny bit of math, way less than the, than the example. the thesis,

27:54 the dissertation. Yeah. so there's a whole bunch of stuff built into Python that you can,

27:59 to, to, to work with comparisons. And one of the most common ones I'm trying to get there

28:05 is, math is close. So there's a math library that's, it's that with an is close function

28:11 that it's used to just say, Hey, I've got two values. Are these close, close enough? and,

28:18 we, when, if you're using, if you have to compare floats, something like this is, is great. And be

28:25 underneath the scene behind the scenes, what it does is it's, it's taking the two values and

28:29 subtracting them and figuring out if the Delta is, or the absolute value of the Delta is below some

28:36 tolerance, some reasonable tolerance, like close enough. And what that tolerance is,

28:41 is either a relative or absolute tolerance. And, you, most of the time you can kind of get away

28:47 with not caring about that, but if you do care about it, you can control that you can pass in

28:52 what tolerance you expect things to be closer to. I use stuff like this all the time with,

28:57 with test equipment, because I, I definitely want to know, control over the tolerance levels.

29:02 So, yeah, for sure. So there's math is close, but then there's also, I'm not going to

29:08 scroll all the way down here, but there's, there's, he also covers numpy. So numpy has got a

29:13 couple of these that are really great. One of them is, is, is close also, but it works on arrays and

29:19 it'll give you an array of, true and false values, but you can also use all close,

29:25 which just says you've got two arrays. And if all of the pairs are close enough,

29:30 it'll match those up. also covered, which, we use during testing a lot is py test prox,

29:37 which is a little bit of a different beast, but, but David covers that. So, basically this

29:43 is a semi regular reminder to anybody using floating point math in Python that you should be careful

29:49 with it or any other language. So. Yeah. It's not a Python thing. It's just a fit representing

29:55 things that don't fit. Now there's some things sometimes where you have to be very exact.

29:59 You need to be very precise. And in those cases, Python does have the decimal and fraction types.

30:05 and David covers these in the article, which are cool. They're cool things to know about,

30:10 like definitely around, people using money or, or other, very high precision. But if you're

30:17 also, so there's, those are covered. They do get some sort of a hit for those. But if you really care about,

30:23 like the precision and want to want to do things exactly right, then you probably should read that

30:29 larger article because there's things that you have to do like, certain operations before

30:34 other operations to try to keep the area error from accumulating too high. So there's, it gets messy.

30:39 Interesting.

30:40 I think I'm fundamentally disturbed by the idea that zero isn't zero. So my approach to floating

30:45 point numbers is normally convert them to ints. Yeah. I was thinking that, yeah, sometimes that is

30:53 the way to do it. Right. I was thinking this kind of stuff maybe applies a lot to the project that

30:59 you're working on. If you're trying to come up with ratios that represent, you know, how risky something

31:05 is and things like that. Yeah. Yeah. Yeah. I mean, certainly a lot of, yeah, I was being a bit

31:10 flippant before. It's just as fun. It's like, I'm a very platonic at heart. I think so. Like zeroed

31:17 one should be zero one, not nearly one of nearly zero. There should be a perfect square in a perfect

31:22 circle. Like how can they not exist in our language? Is it really zero or negative zero?

31:27 Henry on the audience. Henry also points out that PI test approximate also works on numpy arrays as

31:37 well. Nice.

31:38 Which is pretty cool.

31:39 Cool.

31:39 Cool.

31:39 You can put that all together. All right. Let me tell you all about Piper. I think that's,

31:45 that might be the representation, the way you pronounce it. Everything needs its own description,

31:49 its own like little phonetic bit. So this, this is a, a simple way to create scripts that run and do

31:57 stuff on your computer using Python. And what's cool about it is it has a real simple way to define

32:02 the steps. Some of those steps can be optional, but then you can also piece together things like

32:07 other programming. So you can combine commands, different scripts in different languages and

32:13 applications all into one sequence of events that happens on your computer. So it's basically a task

32:20 runner where you define stuff in YAML. And probably the best way to see is to go check out the docs. And

32:25 there's a whole bunch of docs. The docs are really nice here actually. So for example,

32:30 if you go to getting started and come down here and run your first pipeline, I really like the way

32:35 the docs here look, how they look, but the way you define it, here's like a one, one step one is you

32:41 just say the steps and it's all YAML and give a step a name so you can refer to it. And then you have inputs

32:47 and outputs and outputs and you do the little curly string interpolation types of things. Or you can

32:51 have more complex ones like with different steps and you can even have little comments. There's a way to

32:57 put a comment in your YAML file as well. So there's also conditional. Let's see if I can find a good

33:03 conditional one down here. Here's on it goes and works with like, this one is just an echo

33:09 statement and the ping command, but you know, whatever, whatever you want to do, you can basically

33:14 pass command line arguments to the YAML file or to the workflow, the pipeline, and it'll take those and

33:21 feed them into the steps. So for example, when you call it, you can say like count equals one and IP

33:27 equals that. And those will come the little string interpolated pieces that go in there. So you can

33:32 just combine whatever, basically whatever commands are available to the shell, right? Be that Python or

33:38 POSIX or windows or PowerShell or whatever you're looking to do. Pretty cool, huh?

33:42 Hmm. That's pretty neat. I might need this for my, my job of, automating my show notes.

33:49 I might use some of this.

33:50 Oh yeah, there you go. If you can find this, go do that. And so on, like, here's one

33:55 that sort of uses the truthiness. So it says there's a bunch of different steps and the,

33:59 you can use the run flag. So here it says run if there's a value for a on this one. And this one

34:06 says run if there's a value for B. And then there's an example where it says, okay, we run it by itself.

34:10 Those don't run. But if you pass a, then it runs that a step. If you pass B, it does the B step,

34:15 or it can do both if you pass them both. And I like the simplicity of it. Like a lot of these tools,

34:20 like this feel like they're pretty complicated.

34:22 You know, you're sort of like your example with the Genshin, Brian, where you're like,

34:26 is this thing too heavy weight for what I'm trying to ask it to do? You know? And this seems like a

34:31 real simple thing. And I don't have to learn about make or any of those kinds of things.

34:34 Yeah. GitHub actions or, yeah. Yeah. Yeah. It's got a bit of a GitHub actions feel to it.

34:40 That's, but it seems like a nicer kind of declarative. That's really cool.

34:45 Indeed. Yeah. If you were not, not into programming or you didn't want your steps to be programming,

34:49 but of course what happens at each step, you could call a Python app or script. That's going to do

34:55 something complicated, right? If it needs to, can you, can you, the orchestration of that,

34:59 you don't have to make complicated. Is it just a command line too? Or can you invoke it from Python?

35:03 Might be a bit interesting. I'm sure there's, there's a way to import it and make it do, do a thing. You

35:09 know, it's probably just a Python package with an entry point in this package. So I would think so.

35:14 Yeah. Cause it would be nice to be able to do that rather than just using sub process to invoke a lot

35:18 of things. Like if you're in. Oh, interesting. I hadn't really thought about it as a replacement for

35:23 sub process, but yeah, because a lot of times when you're trying to orchestrate stuff, like it talks

35:28 about here being part of the shell or being another app or another language, you would just use sub process

35:34 on it. Right. Yeah. Cool. Well, there it is. Piper, Piper.io and people can check that out. It looks,

35:40 looks pretty interesting. Nice. All right. Ian, you want to take us out with your final item here?

35:44 Ah, pigments. Okay. So this is a package. I mean, if you were a developer, there's a very good chance

35:49 that you have been using this for years without, like me, without knowing about it. You might have

35:54 seen it being installed as like a dependency. It's like, what is that thing? That was my thought,

35:58 Ian. I'm like, I know I see this all the time in my dependencies and I just never really bothered to

36:03 look into what it does. Yeah. So I hadn't until recently. So if you use, if you use Jupyter Notebook

36:09 markdown, you know, you can look like three backticks and, and then a block of code. And you can actually

36:16 put like Python or bash or something as a, and it will intelligently highlight it. So the thing that's

36:23 doing that intelligent highlighting is pigments, GitHub markdown, same kind of thing. Although I'm not

36:28 sure whether GitHub uses pigments. And if you do developer docs, like reader docs and Sphinx,

36:34 that also uses pigments to kind of color code your, your code samples. And I know there's a lot of,

36:41 you know, writing kind of blog posts and stuff like that. You, there are some,

36:45 quite a few services out there where you can take a chunk of code and it will, intelligently

36:50 highlight it and give you a, a JPEG or a PNG back. And that's kind of nice, but then you can't copy

36:56 and paste the code from those samples. So I don't like that really. I think if you're going to put

37:00 code in a article, you, you're probably intended for people to be able to copy and paste it.

37:05 Yeah. That's the most likely thing you are to copy and paste.

37:08 Yeah.

37:08 Yeah. Right. Cause you want that code over here.

37:10 Yeah. You don't want an image of your, I mean, cause you could use OCR to like reinterpret it,

37:13 but it's all, yeah. And then maybe, maybe Brian's gen sim to like, tidy it up.

37:19 but, so with pigments, you can use it as a standalone package and it can do this kind of

37:27 rendering, and it can render to like HTML with like CSS style sheets for all of the coding. It also

37:33 rendered to like NC terminal, latex, a few other, other kinds of things. So if you're using,

37:40 you know, if you want to get a nicely formatted piece of code in, in a document or you're doing

37:45 developer docs, it's certainly kind of useful. I mean, I came across it. or should I just say

37:50 one thing that also supports, maybe I can just switch supports lots and lots of languages. So it's,

37:55 very simple to use. It has a highlight function. and then you import Alexa, which is like the

38:02 thing that understands the tokens in a language and the, a formatter for the output type you want.

38:08 And I think there's hundreds of these things. So, and, and, and there are a lot of languages in there.

38:12 No kidding.

38:13 I'm more than half of these I've never heard of. And it also supports as well as things like,

38:17 you know, you'd expect Python, it supports Python tracebacks. So it has separate Lexer for color

38:22 coding tracebacks. all the usual languages you'd expect, but also some things like data formats,

38:28 like, Toml, Jason, XML. okay. Interesting. Like a lot of the files that we might run across.

38:36 Yeah.

38:37 Yeah.

38:37 Yeah. and so it's very, very easy to use. And the reason I came across it is because I,

38:44 it recently, so a lot of attacker code tends to be a deliberately obfuscated. So it's kind of base

38:51 64 encoded, but then even once you decode it, it's kind of munged in a way to make it as unreadable as

38:57 possible. So one of the things that we try to do is, is pull that code back, like decode it, trying to re

39:03 like clean it, deobfuscate it. but if you have, if you can present it in a, as close to the way a

39:09 developer would write it as possible, it makes it much quicker for an analyst to determine what,

39:14 what is this doing? so we've used it now in, in mystic pie to kind of, color display things like,

39:21 well, it's just power shell script or, bash or something like that. So that's how I came across it.

39:26 Actually, rather than just seeing it go past as part of a pip install, actually have to invoke it

39:32 directly. So, so I kind of big shout out to the, the developers and maintainers of pigments.

39:38 It's one of those package that probably millions of people benefit from, but like very few people

39:43 kind of know about it or, you know, you can, and it's just super easy to use. They seem to be adding

39:48 kind of flexors all the time. So, great. Yeah, this is amazing. I didn't realize that it did all

39:54 of this. This is a way more advanced than I thought. Brian, did you know? No, I just thought it was

40:00 something that magically syntax did syntax highlighting. So I didn't have to care about it.

40:04 Yeah, exactly. I got a little example in the, in the show notes as well. I posted it has a dark theme.

40:13 Yeah. Yeah. yeah. And you, you probably want to include this no background equals true

40:18 if using a Jupyte Notebooks. Cause if, if you select a theme, it just flips the whole notebooks kind of

40:23 CSS theme. So that tells it just not to mess with what, what's in the background. Okay. yeah,

40:29 that looks great. Yeah. Thanks. Thanks for pointing out how useful that can be. That's, that's cool.

40:34 Like I said, I've seen it go by all the time. I just never really paid that much attention to it.

40:39 It's probably a pretty minority use, but like if you need it, it's great.

40:42 Yeah. It's incredibly powerful. Fantastic. Well, that's all of our main items. Brian,

40:46 you got any extras? just one extra, actually. One of the things when I was doing that, the

40:51 first topic with GenSim, the, one of the dependent, it doesn't have very many dependencies,

40:57 but one of the dependencies is this, this library called smart open. And I'm like, what? I,

41:03 I open things and I want to be smart about it. So I wanted to check this out and it's pretty neat.

41:09 I don't know if we've covered this before, but it's a, it basically mimics the interface of open

41:15 normal Python open, but you can pass it really anything in. And it does, like,

41:22 transparent on the fly reading of things, efficient streaming of large files from like S3 or Azure

41:29 or, or over the web.

41:31 Even straight just HTTP. Yeah. If you just have a link to a large file on a web server.

41:35 Yeah. And, and then just the code for it is just like super nice. You know, you, you import open

41:41 from smart open and you got like four line in open this thing and, just, you can work from each

41:49 line there. It's pretty cool.

41:51 I love it. That's a, that's a great one. Very nice. Ian, you got any extras you want to

41:56 shout out while we're here? I don't, I'm afraid.

41:59 I have, I have, I have two real quick ones, to just quickly talk about. Last time,

42:07 Emily Morehouse spoke about using auto squash, which was really cool. So Adam,

42:14 let me get the attribution correct here. Adam Park Parkin sent in a follow-up to say,

42:20 hey, you should check out this article over here called fixing commits with

42:25 git commit --fix up and git rebase --auto squash.

42:29 Woo. The long and the short of it is talks about doing a lot of things that Emily said was pretty

42:34 cool, but in the end setting up your.git config to auto squash equals true, and then adding an alias.

42:42 So you can just type git space fix up. And when you type that, it actually does get log and shows

42:47 the last 50 items and then allows you to go back and work with those. And basically it's just a real

42:54 quick way to get back into the scenario where you mark different elements for fix up. So people can

43:00 check that out if they were following Emily's advice, but they want it to be like one line. They

43:05 don't have to remember. There you go. That's cool. And then Python 310.3 is out as of about a week

43:11 ago, I suppose. So there are many changes amongst here. You know, I would love, there's like so many

43:17 great changes here. I don't know how many do you think that is probably a hundred, maybe a little

43:22 bit less. It would be great if there was like a, these are critically important at the front. Like

43:28 there's a security problem that was fixed, or there's a thing we've taken out is no longer here.

43:33 They're kind of all the same priority. But nonetheless, there's a bunch of changes that

43:37 people can check out and upgrade to the newer version of Python 310.

43:41 different people care about different stuff though. I know. I don't want to impose my importance on

43:47 other people's importance. Yeah. So it's funny when I first came across, first came across Python,

43:52 you were kind of like, why is it so slow between the major versions coming out? But then suddenly

43:57 it's like a Python developer. It's like, why are the versions coming out so quickly?

44:00 Yeah. It's definitely true. There's a ton of change. This is just, you know, some minor version

44:08 change that has these, all these changes in here, which is pretty cool.

44:11 Well, we also used to be on an 18 month cycle and now we're on a yearly cycle. So just yeah.

44:16 Yeah. Lucas Schlinger's fault that we are 50% faster now. Thanks Lucas. All right. How about a joke to

44:23 close out the show? That'd be great.

44:24 Yeah. So here's a good tweet and it's this sort of perplexed, I think in a good way,

44:32 character wearing all these, are these prizes? I don't know. Anyway, Python developers, when someone

44:38 asks what their secret is, and this person just says, I just keep writing pseudocode and it just keeps

44:44 working. It's a little bit like that joke where they have some code, pseudocode in a text file.

44:50 They're like, just rename it to .py and try to run and see what happens. Anyway, that's the joke.

44:55 Nice.

44:56 Thank you, Brian, as always. And Ian, thanks for being part of the show.

44:59 Thank you. Great to have you here.

45:00 Thank you very much both. It's been a real pleasure.

45:02 Yeah, it sure has. See y'all.

Back to show page