Transcript #276: Tracking cyber intruders with Jupyter and Python
Return to episode page view on github00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds.
00:04 This is episode 276, recorded March 22nd, 2022. So many twos. I'm Michael Kennedy.
00:13 And I'm Brian Okken.
00:14 And I'm Ian Hellen.
00:15 Hey, Ian. Welcome to the show. It's great to have you here.
00:18 Thank you very much. I've listened to the show a lot and feel very privileged to appear on it.
00:23 It's our privilege to have you here. Thank you so much for listening. And I know you got some
00:29 cool stuff to share. So we're looking forward to hearing about that. Also, I do want to say
00:33 thank you to Fusion Auth for sponsoring the show. I'll tell you more about them later.
00:39 Before we get into the topics, Ian, tell people a quick bit about yourself.
00:42 Sure. I'm a developer in Microsoft, the Microsoft Threat Intelligence Center.
00:48 Been with Microsoft for quite a long time. Only relatively recently, like four years so ago,
00:52 got into Python coding with Jupyter Notebooks. So I work on Jupyter Notebooks for the Microsoft
00:58 Sentinel project and own a modest open source package that we'll call MysticPy, which we'll
01:05 cover a little bit later. Takes most of my time.
01:07 Fantastic. The whole cybersecurity threat detection stuff, it's very interesting. There's a lot of
01:14 innovation there, but it's also, it's a challenging area to be working.
01:17 Yep. Yep. We're never sure of stuff to do.
01:20 Certainly. I'm sure you're not. Well, Brian, how about you kick us off here?
01:25 Well, so I'm going to start off with a problem. So I, I had a problem and I have a cool solution for it.
01:30 So my problem is on test and code, I've got titles and I want to end a show on it. It's MP3 file,
01:39 but I want to create a show notes, automated show notes or not show notes, a transcript.
01:45 So one of the problems, there's a lot of problems in doing this, trying to automate it, but one of them
01:50 is the title. I want to turn that into something that's a little bit, so something like, you know,
01:57 it's got normal English and capitalization and all sorts of spaces and stuff. I want to turn that into
02:03 things that URLs hate. Yeah. I want to turn that into a URL. And, and one of the problem,
02:09 one of the things is getting rid of stop words. So there's a bunch of stuff like lower casing.
02:13 I can do that easy, but getting rid of stop words was a little hard. So I ran across this,
02:19 this thing called gen sim parsing, pre processing thing. So pre processing. So gen sim is a larger sort
02:28 of beast. It's a, it's a used for machine or machine learning and stuff to generate models. But I am,
02:36 I'm just really using one little piece of it, the pre processing part. And it's, it's really pretty
02:41 cool. I was looking, I actually found this article first. There was an article called removing stop
02:47 words from strings in Python. And it has, it has a discussion of NLTK and gen sim and spacey. I tried all
02:55 of them out actually. And the one that really stuck best for me is a using, talked about using
03:02 remove stop words is exactly what I wanted right from, from gen sim. So I went ahead and tried that
03:09 and it worked really well, but I'm like, wait, I'm pulling this is in from the pre processing library.
03:15 I wonder what's what else is in there. And there's all sorts of really cool stuff in here. there's a
03:22 lower lowercase to Unicode. It turns it both into lowercase and in Unicode. That's pretty neat.
03:27 Don't think I need it, but that's neat. but then there was one that was, pre, I thought
03:33 maybe this is exactly what I want is, something called pre-process string. And it has a whole bunch
03:38 of filters built into it. Oh, nice. Like strip strip. Yeah. Strip white space, strip punctuation. I love it.
03:45 Yeah. And take away multiple, after it strips punctuation, like you're going to have,
03:49 if I go back, I had a slash in my title for one of the episodes. If it takes that out, I'm going to
03:56 have a space before and a space after. So I want to remove those. So it'll strip multiple white space
04:01 strips out numerics. Cause I probably don't want numbers in there. and then remove stop words.
04:07 The one thing I don't want that I'll have to like, customize how I'm calling this is a stem text.
04:13 So stem text, I didn't, I don't know what that did without playing with it, but what it does is it
04:18 would take things like twisted and turning it, turn it into twist. That's, that's really not right.
04:23 So you definitely don't want that. I don't want that. I don't mess it up, but I think I want
04:26 everything else. So, this gen SIM, library has, you know, if you're doing machine learning,
04:32 coming up with models, I think this is a great, tool to look into, but if it's actually,
04:39 I'm going to use it just for, removing to create these titles for, for, you know, my podcast,
04:45 but the, I think it, it feels a little weird. It feels like I'm using this really big hammer to do
04:51 this little tiny problem. I guess I'm okay with it, but you know, do you have any other ideas
04:57 where it could use or, well, I didn't know about this. So I wrote my own. Okay. And it's, it's,
05:04 it's kind of janky. Like it's a little bit, a little bit recursive iterative. It's like,
05:08 we'll take away all the punctuation. Now turn all of your white spaces into single white spaces.
05:14 Cause there might've been, you know, dot space. So now you've got two white spaces, but you've got to
05:18 take away, you know, there's like a bunch of weird steps and then, then put it back. This looks
05:22 cleaner. It is a dependency, but it does look cleaner. I like this. I think it's, I'm glad I
05:27 know about it. Ian, what do you think? Is it a huge thing? I mean, dependency, but, I always
05:32 think of like ML like stuff, but this is like just the pre-processing, right? Well, I'm actually pulling
05:37 in all of GenSim to get this. I don't know if I can pull in little bits, but, it's, it's not
05:42 really part of my application that I'm shipping. It's just a tool that I'm using on my laptop. So I,
05:49 I guess downloading it once doesn't really bother me too much, even if it's a big thing.
05:53 But cool. Yeah. I was thinking, yeah, that's a good, that's a good point. If it's running
05:56 local, it's like a dev dependency, who cares? Right. It's like worrying about how big pytest
06:00 is. Like it doesn't really matter. And I'm not, well, I kind of get care about that. Cause
06:05 CI is going to pull it in all the time for pytest, but. Yeah, but they got fast networks.
06:10 It's not your bandwidth. It'll be all right. One of the things that struck me about this
06:16 that made me think of your situation is like that lowercase to Unicode in so many
06:21 times in the security space. It's about like, you're checking for this representation, but what
06:27 if there's another representation that means the same thing? Like you don't say go to this
06:31 directory. You say go dot, dot. And then over there, you know, those, those kinds of non-canonical
06:35 representations. I wonder if there's any use of this kind of stuff for you.
06:39 Yeah. There's something I kind of touch on the pigment section later on, which like the
06:43 attackers typically write scripted attacks and try to obfuscate code using a mix of kind of
06:49 uppercase and putting random dots. I'm just thinking that'd be a nice, potentially a nice
06:53 way of kind of cleaning some of that, that stuff up.
06:56 Yeah, for sure. There was a, there's been some interesting supply chain vulnerability stuff.
07:01 Remember, remember the guy with the color and I think the faker stuff in JavaScript that
07:06 sabotaged his, his libraries. There was another one that maybe well-intentioned. I don't know. It,
07:14 it was some open source library. I don't believe it was Python. I can't remember what it was.
07:19 It could have been, but I'm pretty sure it was in JavaScript because that's where all,
07:22 most of the bad stuff was, it seems. Anyway, they wrote their, they, they taught their dependency
07:28 to erase everybody's hard drive who installed it, who was in Belarus and Russia, which, okay,
07:36 maybe they're trying to contribute, but like it ended up doing a bunch of bad things, even to places
07:40 that were like trying to help say people in the press and journalists do certain things and then like,
07:46 you know, connect with sources and in a race like that database as well. And what they did to make it so
07:51 that nobody would notice in the GitHub commit before it went out to NPM was base 64 encode their
07:58 changes. So I basically put a base 64 encoded string and then like decode and then run that.
08:03 And, you know, it's like that kind of stuff. I know this won't solve that problem, but yeah,
08:07 you know, that, that sort of category of like weird representations.
08:10 Yeah. You need mystic pie for something like that. It's one of the things we, yeah, it's a common thing,
08:15 kind of basics before decoding before the obfuscating. But yeah.
08:19 Yeah. Interesting. yeah, I thought of maybe using something like that with,
08:24 because one of the problems we have is like every, every script is kind of slightly different.
08:28 if you could use something like that, essentially kind of apply like sentiment analysis to
08:34 script. I mean, this is a big problem. It's just not something I've particularly solved.
08:38 but that might be a kind of useful, useful thing to just picking out certain things that indicates
08:43 malicious, like format, you know, format drive.
08:47 Exactly. Yeah. You could certainly represent like this one does hard drive stuff. Is this,
08:52 I thought it was parsing colors. Why is it doing things with the hard drive? This is odd,
08:55 you know, like, or with the network, stuff like that. Cool. All right. Well, you know what you
09:00 would really want to check out if you were trying to research these things, probably documentation.
09:04 So I want to tell you all about dev docs, dev docs.io. This is pretty cool. Now, when you get there,
09:09 it's an interesting on my Firefox, it's just got like the mobile view, which is really odd. If you go
09:15 there with a full browser, it's what it believes is the full browser. I guess it's like a slightly
09:20 different view. That's pretty similar, but not the same. So there's, if you open up a whole bunch of
09:26 programming technologies, let's say not just Python or JavaScript or something, but there's also Vue JS.
09:33 There's Vexoig, for example, like some of the foundation of flash and you can pick the particular
09:37 versions and stuff. So you can go in and like enable these different things. So maybe I care about
09:42 view. I can go over here and enable that one. Let's, we definitely want some Python. Let me go find
09:47 some Python and it gives you all the versions. I'll take that. And let's say I'm also working with
09:52 Postgres. So I'll enable that documentation. And then I might be working with engine X for the front
09:57 and which is somewhere right here. So you can go enable that. And then it will be up near the top
10:05 somewhere here. You can see these are either the default ones or the ones that I checked on. So then
10:09 you can open them up and say, I want to go and see the engine X guide about a debugging log. And then it
10:15 takes you to the documentation for that technology. So it's like a meta documentation repository for all
10:22 of these things all at once, which is pretty cool. Right? So I can go up here and search. I want to
10:26 know about like, let's go about like media tags or something. So you can see the stuff in HTML5. You
10:33 can see the stuff in when you say media, it looks like median. So you can see that in the statistics
10:38 module for Python, some stuff for CSS, or you could come over and say, look, I just want to search for
10:44 CSS. And then you get like using media queries and how to do that kind of stuff. So it's kind of a,
10:50 what you do is you turn on the pieces that are relevant to you, and then you can search across
10:54 those technologies. Cool, right? Wow. Yeah. And, and then if you're on the move, you can come over
11:01 here and turn on offline, offline data, and it'll download all of that as an app so that then you're
11:09 the coffee shopper and you're playing, you now have all the documentation for Python 310, Vue.js,
11:13 Verix.soing, Nginx, et cetera, et cetera, that you can use, which is pretty cool. And this is something
11:19 that drives me crazy about Firefox. They had it and they took it away. And I don't understand why,
11:24 because I'm feeling as firebox is about what the web. So they took away the ability to do progressive
11:29 web apps in Firefox, but all the Chromium browsers support it. So you can actually go and install this
11:35 as a dedicated application on your system. So you, if you have no web, you just click that open. It's
11:41 its own window. You can up, you know, alt tab, command tab between it. Super easy. And then turn on the
11:47 offline mode. And you basically have an app that has offline documentation for all the programming
11:52 technologies that you care about. So this is my new coffee shop buddy.
11:56 Is the search go across the thing you've selected then? So if I search for like replace or something,
12:01 it's the things I've selected.
12:03 Yeah. So if you turn on like JavaScript and Python, it would look for that in both languages.
12:08 Oh, okay.
12:08 Yeah. So basically the ones you turn on, there's a ton of them, right? And you pick,
12:12 you say, these are interesting to me and then search and stuff from what I can tell only applies
12:16 to the technologies you say you care about. Cause like if, if you don't use Java, you really don't
12:21 want to see the documentation for Java search, right? That would be useless.
12:23 Yeah. One of the things I like about this is it also has versions. So, if you're using a,
12:28 like an older version of Postgres, you can just enable that version.
12:33 Right. Sometimes it doesn't matter very much, but other times it matters massively like bootstrap
12:38 three and bootstrap five, they're like fully incompatible basically. Like they're totally
12:42 different keywords and grid systems. And you don't want just the latest. If you've got an old app
12:47 you're working on something like that. Python's more forgiving about that kind of stuff, right?
12:51 It doesn't break as often.
12:52 I was amused that the, the list though is, it has like three, nine, three, eight for Python
12:59 and it has three 10 at the bottom because one is obviously.
13:03 Cause it's alphabetically sorted. How interesting. Ian, what do you think of this?
13:07 That's very cool. I'm amazed. Is somebody at dev docs kind of manually maintaining all of the links
13:13 to these, like the original source documentation?
13:17 Yeah. Where are they getting it from? Right. I mean, cause there's, they're super disparate.
13:21 It's like matplotlib and markdown and MariaDB. These are all, it's unlikely they're all stored
13:26 in the same basic system. Right. I don't know how they get them actually.
13:29 Yeah. That's very cool. I mean, I know, I normally have solved the same problem by having like 130
13:34 tabs open to different bits of Python docs and pandas and.
13:38 Exactly. Exactly. Yeah. I'm pretty sure they got pandas in here.
13:42 They got numpy as its own thing that we saw matplotlib. There's pandas and there's even,
13:47 you know, versions of pandas across there.
13:50 Single tab solution. Brilliant.
13:53 Yeah. It looks, looks pretty good to me. All right. You want to tell us about what you got
13:57 for your first item?
13:58 Okay. Sure. Yeah. so, as I mentioned earlier, I own a package called mystic pie.
14:06 and first thing to sort out with it is the spelling because I suffer from this on a daily basis,
14:10 mistyping it, even though I've owned it for like three or four years. So it's MSTIC standard for
14:17 Microsoft threat intelligence center. There's no why or anything like that in there. So it's a tool
14:21 set for cybersecurity investigations and hunting in Python, mainly in Jupyter notebooks. So there are a
14:29 couple of questions to ask about that. So firstly, what is cyber security hunting and investigation and
14:35 what it, why are Jupyter notebooks useful? So the first one, cyber sec investigation is really responding
14:41 to alerts or other kinds of threat intelligence and trawling through typically large amounts of
14:47 security logs from cloud services, hosts, account services to determine whether this is a real threat
14:53 or not. And there are two main kinds of... That's one of the huge problems, right? Is you've got all
14:57 these different systems. How are you going to know if someone, if you don't have a tool like this,
15:02 how are you going to know that something, someone's in there rooting around, right?
15:06 Yeah. Yeah. And there are a couple of things that usually trigger this kind of search. So one of them
15:11 is a, an alert may be coming from your seam and that's a, that stands for security, information,
15:17 event management. So the, like a console, like, ArcSight is a traditional one or Microsoft Sentinel
15:24 is a cloud-based one. so you get an alert based on a rule and you need to go in a fairly managed process.
15:30 Somebody needs to go and investigate. Is this a real threat or is this just noise? or there might be something
15:35 like the solar winds, they never a year ago, the log four J, like something in the press or something
15:42 from a threat Intel kind of alert says this kind of threat is around and that's a more ad hoc process kind of hunting.
15:49 Like, do we see this in our organization? so that's kind of what mystic pie is trying to, you know,
15:55 try to address the needs of that. and the second question is why Jupyter notebooks? Why would you do any Jupyter
16:00 notebook rather than in your existing sock tools? I mean, I think there's a lot in common, this kind of
16:08 activity is a lot in common with like big science data, sorry, big, big data science. I mean, something like
16:15 astronomy where you're kind of, you know, hunting for an adversary activity is a little bit like trying to find an exoplanet
16:21 in kind of gigabytes of data or a new quasar or something like that.
16:25 a hundred thousand stars or a hundred thousand lines of log file and you're hunting for some patterns and stuff.
16:31 Right. And you've got a few photons you're trying to determine are these kind of different, you know,
16:35 something like, like, an adversary activity is a little bit like that. It's like millions and millions of events
16:39 and you're trying to find the bad stuff. so traditional sock tools are kind of, you know, can be really excellent.
16:45 And I work with one that I think is, is really good, but, but they all have limitations.
16:50 What's a, a sock tool, a sock tool, a sock security operations center. So, so something like, you
16:56 know, a console that fires alerts and tells you that they have a bunch of analysts, engineers looking
17:03 at this output of this and deciding, and that's the trigger for their investigations. They're like,
17:08 is it like failed log in the SQL server?
17:10 Yeah. Something like that. Or, you know, it could be more sophisticated thing. Like,
17:15 something's exit, you know, tried to access the kind of password data on this, or looks like it's trying
17:20 to access the password data on this host or, or has made a weird kind of configuration change to, mailbox settings.
17:28 So all those kinds of things can kind of trigger alerts and investigations. but you are limited
17:34 in most kind of operation center environments. Notebooks allow you to kind of break out of some
17:39 of the constraints of that. So firstly, you can get data from anywhere. you're not just limited by
17:45 kind of what's in your logs. You could go to virus total or so you can bring data from anywhere.
17:50 you can use customized kind of analysis. so write your own or get, get things from PyPI. Lots
17:57 of people have kind of written this stuff. you control the workflow. So, so you don't have to follow
18:02 what the tool says. You can reorder things, you can backtrack, redo things, and the workflow is repeatable.
18:08 So if you get a similar kind of, you know, issue again, or similar kind of alert, you can
18:14 fish out an old notebook and rerun the same kind of analysis. And you end up with a nice kind of
18:19 shareable document that, it describes your investigation a bit like the results of a
18:25 scientific investigation. It's like, here are all the steps I took and these are the results.
18:29 And this is what they, this is what we determined to be the bad, you know, the bad activity.
18:33 Right. The other thing that seems useful here is Jupyter. Often the notebooks will save the last
18:40 bit of computed information. And then you can go, you know, change a cell, ask the question again,
18:45 change without rerunning the whole thing. And like that's parsing tons of logs or
18:49 pulling them over SSH or whatever that not doing that again is nice.
18:53 Yeah. And it's brilliant. If you don't like doing lots of queries in different browser tabs
18:57 and your browser crashes, they've all gone. What do you do?
19:01 It's all in a Jupyter notebook. I say, it's like second by second, after you do it,
19:06 you can just go back and you can go back to things like you may have done months ago.
19:09 So, yeah, absolutely.
19:11 Yeah. So, so when I started all of this, I kind of thought a lot of this stuff for cyber
19:16 investigations would be available on, and PyPI. I thought great Jupyter notebooks seem like brilliant.
19:21 And there's going to be process tree viewer and there's going to be an event timeline and all this
19:25 kind of stuff. and I found out there wasn't, at least I couldn't find it. so I decided to
19:32 just like stop everything. Need to start writing this, this stuff. So it turns out that things like
19:37 visualizations you need for detecting exoplanets are a bit different from ones you need to detect,
19:42 bad actors. So, so we started building this thing originally me, but there's now, Pete
19:48 Brian and Ashwin Patil also kind of, working on it to my colleagues and a bunch of people in the
19:55 community. It's got four main functional sections. It's like data querying, how you get data in,
20:01 how you do templated queries as enrichment. So for example, if you have something like an IP address,
20:07 you might have a bunch of questions about it as an analyst, like which geographical location is this IP
20:13 address from does it, or any malware reports about it. third areas analysis are things like
20:19 anomaly identification, like the thing you've talking about a spike in, in failed logon events,
20:25 unusual spike in failed logon events, that kind of thing. the final area is visualizations,
20:29 and these are like more specialized. I've got kind of a couple of examples in the show notes.
20:34 this is like anomaly identification pattern. This is one of, one of the custom, we use Bokeh,
20:40 which I really, this is really nice kind of visualization package, to allow you to kind
20:45 of view data in a way that analyst kind of expects you to s to see it a bit. So they're more of this
20:51 kind of visualization than more traditional kind of graphs. I would much rather look at this than
20:56 log files or event logs or, or whatever, you know? Yeah. That's the whole thing about, you know, you,
21:00 you, you need, you may have thousands of events and you need to get down to the few that are the
21:04 interesting, the interesting thing. so one of the areas that we've, we try to focus on
21:10 currently, cause we wrote all this stuff and you have like hundreds of functions that you could use,
21:15 but it's kind of difficult to discover them. And they all, cause they evolved a little bit organically.
21:21 Like how do you, they were working a little bit of a different way, different set of parameters.
21:26 So the work we're currently doing is trying to make this all a bit more accessible. So all of the
21:31 functions that relate to say an IP address, all the questions you want to ask about it are kind
21:36 of dynamically attached to a class called IP address. So they're all like things like,
21:41 Oh, interesting. Do, do, do. So you don't have to work just with a raw string or just some raw IP representation, but you can ask it questions like its location.
21:49 Well, it's not quite that intelligent. It's even a bit less intelligent than Alexa, but, but it's,
21:55 but it's more like, you know, there might be things like geolocation of an IP address,
22:01 threat intel lookups, different queries that might be, have IP addresses like a, a parameter.
22:07 and previously you'd have to go and find all of these things and import them separately and run
22:12 them. but now they're all kind of dynamically attached as methods to the fact that use IP address
22:17 as a parameter means that you just have one object to import, and then you can do all of these different
22:22 operations, on this single item. there's, there's some things that don't work with that.
22:27 Some things like the visualizations, for example, they're not IP address or host or account specific.
22:33 They work on big blocks of data. So the other area we're working on is try to anything. It takes a
22:39 bunch of data as an input. We're writing those as pandas excesses. so they appear as methods to a
22:46 data frame. So you do kind of data frame dot MP plot dot timeline, right? And it would produce your
22:53 timeline as long as it's the right kind of data or, so yeah, that's one of the challenges of
22:57 writing this kind of thing organically is you end up with a lot of stuff, but nobody knows it's there
23:02 and everybody knows how to import it. So try to make it as accessible so that it just becomes a very
23:07 intuitive thing. Oh, I have an IP address. What functions can I do? I can do this, you know,
23:12 it's all like tab completable, that kind of thing.
23:14 Yeah, I think it's really cool. You've taken this Python data stack view of cyber security and threat
23:21 detection. Yeah.
23:22 Yeah. Brian, what do you think?
23:23 well, it's definitely a complicated area. and it trying to, one of the things I like about this
23:29 story is just talking about the complexities in API design, and discoverability that's a,
23:36 that applies to like lots of different fields, but yeah.
23:39 Yeah. It's one of those things you should have thought about at the beginning, but,
23:42 even at the end, you can tidying things up. yeah. So, Famous last word.
23:49 So yeah, we're definitely open for like other people collaborating, contributing stuff,
23:55 cause there's a lot of ground to cover.
23:56 yeah, for sure. It's on GitHub. I saw one final question before we move on. Is it just for Azure
24:04 or is, is this a thing that more broadly works across different systems?
24:08 No, I think I should have mentioned that a little bit earlier on it. We originally built
24:11 it for Microsoft Sentinel notebooks, but it supports like Splunk, Defender,
24:16 working on an elastic provider. So really anything you can get into a pandas data frame,
24:21 you can use most of the functionality. So even if we don't, we don't have a provider ourselves,
24:26 if you've got something like PySpark and you can get a data frame, then all of our functions take
24:32 data frame. You know, we use pandas as our universal data interchange format.
24:37 Yeah, indeed. Indeed. Kim Van Wick out in the audience likes it. It's way like a much nicer way
24:44 to glean info and logs and complex grip. I'm, I'm right there with you. All right. Now, before we
24:49 move on, Brian, let me tell you about our sponsor for this episode. This episode of Python Bytes is
24:55 brought to you by Fusion Auth. Fusion Auth is an authentication and authorization platform built by
25:01 devs for devs. It solves the problem of building essential user security without adding risk or
25:07 distracting from the primary application. Fusion Auth has all the features you need with great support and
25:12 a price that won't break the bank. And you can either self-host it or get the fully managed solution
25:18 hosted in any AWS region. Do you have a side project that needs custom login and registration,
25:23 multi-factor authentication, social logins, or user management? Download Fusion Auth community
25:29 edition for free. The best part is you get unlimited users and there's no credit card or subscription
25:35 required. Learn more and get started at pythonbytes.fm/fusionauth. The links in your show notes.
25:41 Thank you to Fusion Auth for supporting the show. All right. What do you got for your next one,
25:45 Brian? Number, numbers, something every computer scientist should know?
25:49 Yes. Floating point. Arithmetic is complicated. And so when I started, started working in
25:56 professionally, one of the things I was recommended reading was, an article called what every
26:01 computer scientist should know about floating point arithmetic. And don't worry, it's only like a
26:06 really long paper with lots of math. so I am not telling you to read this, although it is an
26:13 interesting read. What I would like you to read is this article by David Amos called the right way to
26:19 compare floats in Python, because there's a few things that we need to know about floats when we're
26:24 using them and floating points is, and he covers all of this in the article without going through
26:30 tons of scary math is the floating point numbers have to be represented in a way that can the computer
26:37 can store them and use them and manipulate them, even though some numbers are huge and won't fit
26:43 normally. So we have to do things like accept that there's error and rounding. So there's a little bit
26:49 of a discussion there that he talks about. One of the things that surprises people sometimes when they
26:54 first come come into Python, but it's not just Python, it's most, most languages is somewhere.
27:00 There's going to be something obvious that doesn't work like in, in Andy or David's example, 0.1
27:06 plus 0.2 equals or comparison equals, 0.3. And that will show up as false because they don't.
27:14 And this is weird. They obviously are crazy that that doesn't work, but, but it's not just equals.
27:21 You can also do comparisons like, you know, less than or greater than. So it's not only is that,
27:27 are they not equal? They're not like 0.1 plus 0.2 is not even less than or equal to 0.3. It's weird.
27:34 so, so what do you do? You don't, the gist of it is don't compare things with a normal
27:42 math comparisons if there's floating points involved. So what you want to do instead is, and there's,
27:49 here's a little tiny bit of math, way less than the, than the example. the thesis,
27:54 the dissertation. Yeah. so there's a whole bunch of stuff built into Python that you can,
27:59 to, to, to work with comparisons. And one of the most common ones I'm trying to get there
28:05 is, math is close. So there's a math library that's, it's that with an is close function
28:11 that it's used to just say, Hey, I've got two values. Are these close, close enough? and,
28:18 we, when, if you're using, if you have to compare floats, something like this is, is great. And be
28:25 underneath the scene behind the scenes, what it does is it's, it's taking the two values and
28:29 subtracting them and figuring out if the Delta is, or the absolute value of the Delta is below some
28:36 tolerance, some reasonable tolerance, like close enough. And what that tolerance is,
28:41 is either a relative or absolute tolerance. And, you, most of the time you can kind of get away
28:47 with not caring about that, but if you do care about it, you can control that you can pass in
28:52 what tolerance you expect things to be closer to. I use stuff like this all the time with,
28:57 with test equipment, because I, I definitely want to know, control over the tolerance levels.
29:02 So, yeah, for sure. So there's math is close, but then there's also, I'm not going to
29:08 scroll all the way down here, but there's, there's, he also covers numpy. So numpy has got a
29:13 couple of these that are really great. One of them is, is, is close also, but it works on arrays and
29:19 it'll give you an array of, true and false values, but you can also use all close,
29:25 which just says you've got two arrays. And if all of the pairs are close enough,
29:30 it'll match those up. also covered, which, we use during testing a lot is py test prox,
29:37 which is a little bit of a different beast, but, but David covers that. So, basically this
29:43 is a semi regular reminder to anybody using floating point math in Python that you should be careful
29:49 with it or any other language. So. Yeah. It's not a Python thing. It's just a fit representing
29:55 things that don't fit. Now there's some things sometimes where you have to be very exact.
29:59 You need to be very precise. And in those cases, Python does have the decimal and fraction types.
30:05 and David covers these in the article, which are cool. They're cool things to know about,
30:10 like definitely around, people using money or, or other, very high precision. But if you're
30:17 also, so there's, those are covered. They do get some sort of a hit for those. But if you really care about,
30:23 like the precision and want to want to do things exactly right, then you probably should read that
30:29 larger article because there's things that you have to do like, certain operations before
30:34 other operations to try to keep the area error from accumulating too high. So there's, it gets messy.
30:39 Interesting.
30:40 I think I'm fundamentally disturbed by the idea that zero isn't zero. So my approach to floating
30:45 point numbers is normally convert them to ints. Yeah. I was thinking that, yeah, sometimes that is
30:53 the way to do it. Right. I was thinking this kind of stuff maybe applies a lot to the project that
30:59 you're working on. If you're trying to come up with ratios that represent, you know, how risky something
31:05 is and things like that. Yeah. Yeah. Yeah. I mean, certainly a lot of, yeah, I was being a bit
31:10 flippant before. It's just as fun. It's like, I'm a very platonic at heart. I think so. Like zeroed
31:17 one should be zero one, not nearly one of nearly zero. There should be a perfect square in a perfect
31:22 circle. Like how can they not exist in our language? Is it really zero or negative zero?
31:27 Henry on the audience. Henry also points out that PI test approximate also works on numpy arrays as
31:37 well. Nice.
31:38 Which is pretty cool.
31:39 Cool.
31:39 Cool.
31:39 You can put that all together. All right. Let me tell you all about Piper. I think that's,
31:45 that might be the representation, the way you pronounce it. Everything needs its own description,
31:49 its own like little phonetic bit. So this, this is a, a simple way to create scripts that run and do
31:57 stuff on your computer using Python. And what's cool about it is it has a real simple way to define
32:02 the steps. Some of those steps can be optional, but then you can also piece together things like
32:07 other programming. So you can combine commands, different scripts in different languages and
32:13 applications all into one sequence of events that happens on your computer. So it's basically a task
32:20 runner where you define stuff in YAML. And probably the best way to see is to go check out the docs. And
32:25 there's a whole bunch of docs. The docs are really nice here actually. So for example,
32:30 if you go to getting started and come down here and run your first pipeline, I really like the way
32:35 the docs here look, how they look, but the way you define it, here's like a one, one step one is you
32:41 just say the steps and it's all YAML and give a step a name so you can refer to it. And then you have inputs
32:47 and outputs and outputs and you do the little curly string interpolation types of things. Or you can
32:51 have more complex ones like with different steps and you can even have little comments. There's a way to
32:57 put a comment in your YAML file as well. So there's also conditional. Let's see if I can find a good
33:03 conditional one down here. Here's on it goes and works with like, this one is just an echo
33:09 statement and the ping command, but you know, whatever, whatever you want to do, you can basically
33:14 pass command line arguments to the YAML file or to the workflow, the pipeline, and it'll take those and
33:21 feed them into the steps. So for example, when you call it, you can say like count equals one and IP
33:27 equals that. And those will come the little string interpolated pieces that go in there. So you can
33:32 just combine whatever, basically whatever commands are available to the shell, right? Be that Python or
33:38 POSIX or windows or PowerShell or whatever you're looking to do. Pretty cool, huh?
33:42 Hmm. That's pretty neat. I might need this for my, my job of, automating my show notes.
33:49 I might use some of this.
33:50 Oh yeah, there you go. If you can find this, go do that. And so on, like, here's one
33:55 that sort of uses the truthiness. So it says there's a bunch of different steps and the,
33:59 you can use the run flag. So here it says run if there's a value for a on this one. And this one
34:06 says run if there's a value for B. And then there's an example where it says, okay, we run it by itself.
34:10 Those don't run. But if you pass a, then it runs that a step. If you pass B, it does the B step,
34:15 or it can do both if you pass them both. And I like the simplicity of it. Like a lot of these tools,
34:20 like this feel like they're pretty complicated.
34:22 You know, you're sort of like your example with the Genshin, Brian, where you're like,
34:26 is this thing too heavy weight for what I'm trying to ask it to do? You know? And this seems like a
34:31 real simple thing. And I don't have to learn about make or any of those kinds of things.
34:34 Yeah. GitHub actions or, yeah. Yeah. Yeah. It's got a bit of a GitHub actions feel to it.
34:40 That's, but it seems like a nicer kind of declarative. That's really cool.
34:45 Indeed. Yeah. If you were not, not into programming or you didn't want your steps to be programming,
34:49 but of course what happens at each step, you could call a Python app or script. That's going to do
34:55 something complicated, right? If it needs to, can you, can you, the orchestration of that,
34:59 you don't have to make complicated. Is it just a command line too? Or can you invoke it from Python?
35:03 Might be a bit interesting. I'm sure there's, there's a way to import it and make it do, do a thing. You
35:09 know, it's probably just a Python package with an entry point in this package. So I would think so.
35:14 Yeah. Cause it would be nice to be able to do that rather than just using sub process to invoke a lot
35:18 of things. Like if you're in. Oh, interesting. I hadn't really thought about it as a replacement for
35:23 sub process, but yeah, because a lot of times when you're trying to orchestrate stuff, like it talks
35:28 about here being part of the shell or being another app or another language, you would just use sub process
35:34 on it. Right. Yeah. Cool. Well, there it is. Piper, Piper.io and people can check that out. It looks,
35:40 looks pretty interesting. Nice. All right. Ian, you want to take us out with your final item here?
35:44 Ah, pigments. Okay. So this is a package. I mean, if you were a developer, there's a very good chance
35:49 that you have been using this for years without, like me, without knowing about it. You might have
35:54 seen it being installed as like a dependency. It's like, what is that thing? That was my thought,
35:58 Ian. I'm like, I know I see this all the time in my dependencies and I just never really bothered to
36:03 look into what it does. Yeah. So I hadn't until recently. So if you use, if you use Jupyter Notebook
36:09 markdown, you know, you can look like three backticks and, and then a block of code. And you can actually
36:16 put like Python or bash or something as a, and it will intelligently highlight it. So the thing that's
36:23 doing that intelligent highlighting is pigments, GitHub markdown, same kind of thing. Although I'm not
36:28 sure whether GitHub uses pigments. And if you do developer docs, like reader docs and Sphinx,
36:34 that also uses pigments to kind of color code your, your code samples. And I know there's a lot of,
36:41 you know, writing kind of blog posts and stuff like that. You, there are some,
36:45 quite a few services out there where you can take a chunk of code and it will, intelligently
36:50 highlight it and give you a, a JPEG or a PNG back. And that's kind of nice, but then you can't copy
36:56 and paste the code from those samples. So I don't like that really. I think if you're going to put
37:00 code in a article, you, you're probably intended for people to be able to copy and paste it.
37:05 Yeah. That's the most likely thing you are to copy and paste.
37:08 Yeah.
37:08 Yeah. Right. Cause you want that code over here.
37:10 Yeah. You don't want an image of your, I mean, cause you could use OCR to like reinterpret it,
37:13 but it's all, yeah. And then maybe, maybe Brian's gen sim to like, tidy it up.
37:19 but, so with pigments, you can use it as a standalone package and it can do this kind of
37:27 rendering, and it can render to like HTML with like CSS style sheets for all of the coding. It also
37:33 rendered to like NC terminal, latex, a few other, other kinds of things. So if you're using,
37:40 you know, if you want to get a nicely formatted piece of code in, in a document or you're doing
37:45 developer docs, it's certainly kind of useful. I mean, I came across it. or should I just say
37:50 one thing that also supports, maybe I can just switch supports lots and lots of languages. So it's,
37:55 very simple to use. It has a highlight function. and then you import Alexa, which is like the
38:02 thing that understands the tokens in a language and the, a formatter for the output type you want.
38:08 And I think there's hundreds of these things. So, and, and, and there are a lot of languages in there.
38:12 No kidding.
38:13 I'm more than half of these I've never heard of. And it also supports as well as things like,
38:17 you know, you'd expect Python, it supports Python tracebacks. So it has separate Lexer for color
38:22 coding tracebacks. all the usual languages you'd expect, but also some things like data formats,
38:28 like, Toml, Jason, XML. okay. Interesting. Like a lot of the files that we might run across.
38:36 Yeah.
38:37 Yeah.
38:37 Yeah. and so it's very, very easy to use. And the reason I came across it is because I,
38:44 it recently, so a lot of attacker code tends to be a deliberately obfuscated. So it's kind of base
38:51 64 encoded, but then even once you decode it, it's kind of munged in a way to make it as unreadable as
38:57 possible. So one of the things that we try to do is, is pull that code back, like decode it, trying to re
39:03 like clean it, deobfuscate it. but if you have, if you can present it in a, as close to the way a
39:09 developer would write it as possible, it makes it much quicker for an analyst to determine what,
39:14 what is this doing? so we've used it now in, in mystic pie to kind of, color display things like,
39:21 well, it's just power shell script or, bash or something like that. So that's how I came across it.
39:26 Actually, rather than just seeing it go past as part of a pip install, actually have to invoke it
39:32 directly. So, so I kind of big shout out to the, the developers and maintainers of pigments.
39:38 It's one of those package that probably millions of people benefit from, but like very few people
39:43 kind of know about it or, you know, you can, and it's just super easy to use. They seem to be adding
39:48 kind of flexors all the time. So, great. Yeah, this is amazing. I didn't realize that it did all
39:54 of this. This is a way more advanced than I thought. Brian, did you know? No, I just thought it was
40:00 something that magically syntax did syntax highlighting. So I didn't have to care about it.
40:04 Yeah, exactly. I got a little example in the, in the show notes as well. I posted it has a dark theme.
40:13 Yeah. Yeah. yeah. And you, you probably want to include this no background equals true
40:18 if using a Jupyte Notebooks. Cause if, if you select a theme, it just flips the whole notebooks kind of
40:23 CSS theme. So that tells it just not to mess with what, what's in the background. Okay. yeah,
40:29 that looks great. Yeah. Thanks. Thanks for pointing out how useful that can be. That's, that's cool.
40:34 Like I said, I've seen it go by all the time. I just never really paid that much attention to it.
40:39 It's probably a pretty minority use, but like if you need it, it's great.
40:42 Yeah. It's incredibly powerful. Fantastic. Well, that's all of our main items. Brian,
40:46 you got any extras? just one extra, actually. One of the things when I was doing that, the
40:51 first topic with GenSim, the, one of the dependent, it doesn't have very many dependencies,
40:57 but one of the dependencies is this, this library called smart open. And I'm like, what? I,
41:03 I open things and I want to be smart about it. So I wanted to check this out and it's pretty neat.
41:09 I don't know if we've covered this before, but it's a, it basically mimics the interface of open
41:15 normal Python open, but you can pass it really anything in. And it does, like,
41:22 transparent on the fly reading of things, efficient streaming of large files from like S3 or Azure
41:29 or, or over the web.
41:31 Even straight just HTTP. Yeah. If you just have a link to a large file on a web server.
41:35 Yeah. And, and then just the code for it is just like super nice. You know, you, you import open
41:41 from smart open and you got like four line in open this thing and, just, you can work from each
41:49 line there. It's pretty cool.
41:51 I love it. That's a, that's a great one. Very nice. Ian, you got any extras you want to
41:56 shout out while we're here? I don't, I'm afraid.
41:59 I have, I have, I have two real quick ones, to just quickly talk about. Last time,
42:07 Emily Morehouse spoke about using auto squash, which was really cool. So Adam,
42:14 let me get the attribution correct here. Adam Park Parkin sent in a follow-up to say,
42:20 hey, you should check out this article over here called fixing commits with
42:25 git commit --fix up and git rebase --auto squash.
42:29 Woo. The long and the short of it is talks about doing a lot of things that Emily said was pretty
42:34 cool, but in the end setting up your.git config to auto squash equals true, and then adding an alias.
42:42 So you can just type git space fix up. And when you type that, it actually does get log and shows
42:47 the last 50 items and then allows you to go back and work with those. And basically it's just a real
42:54 quick way to get back into the scenario where you mark different elements for fix up. So people can
43:00 check that out if they were following Emily's advice, but they want it to be like one line. They
43:05 don't have to remember. There you go. That's cool. And then Python 310.3 is out as of about a week
43:11 ago, I suppose. So there are many changes amongst here. You know, I would love, there's like so many
43:17 great changes here. I don't know how many do you think that is probably a hundred, maybe a little
43:22 bit less. It would be great if there was like a, these are critically important at the front. Like
43:28 there's a security problem that was fixed, or there's a thing we've taken out is no longer here.
43:33 They're kind of all the same priority. But nonetheless, there's a bunch of changes that
43:37 people can check out and upgrade to the newer version of Python 310.
43:41 different people care about different stuff though. I know. I don't want to impose my importance on
43:47 other people's importance. Yeah. So it's funny when I first came across, first came across Python,
43:52 you were kind of like, why is it so slow between the major versions coming out? But then suddenly
43:57 it's like a Python developer. It's like, why are the versions coming out so quickly?
44:00 Yeah. It's definitely true. There's a ton of change. This is just, you know, some minor version
44:08 change that has these, all these changes in here, which is pretty cool.
44:11 Well, we also used to be on an 18 month cycle and now we're on a yearly cycle. So just yeah.
44:16 Yeah. Lucas Schlinger's fault that we are 50% faster now. Thanks Lucas. All right. How about a joke to
44:23 close out the show? That'd be great.
44:24 Yeah. So here's a good tweet and it's this sort of perplexed, I think in a good way,
44:32 character wearing all these, are these prizes? I don't know. Anyway, Python developers, when someone
44:38 asks what their secret is, and this person just says, I just keep writing pseudocode and it just keeps
44:44 working. It's a little bit like that joke where they have some code, pseudocode in a text file.
44:50 They're like, just rename it to .py and try to run and see what happens. Anyway, that's the joke.
44:55 Nice.
44:56 Thank you, Brian, as always. And Ian, thanks for being part of the show.
44:59 Thank you. Great to have you here.
45:00 Thank you very much both. It's been a real pleasure.
45:02 Yeah, it sure has. See y'all.