Transcript #276: Tracking cyber intruders with Jupyter and Python
Return to episode page view on github00:00 Hello, and welcome to python bytes where we deliver Python news and headlines directly to your earbuds. This is Episode 276, recorded March 22 2022. So many twos. I'm Michael Kennedy,
00:13 and I'm Brian Aachen,
00:14 and I mean, Helen. Hey, Ian,
00:15 welcome to the show. It's great to have you here.
00:18 Thank you very much. I've listened to the show a lot. And I feel very privileged to appear on it.
00:24 It's our privilege to have you here. Thank you so much for listening. And I know you got some cool stuff to share. So we're looking forward to hearing about that. Also, I do want to say thank you to Fusion auth for sponsoring the show. tell you more about them later. Before we get into topics in tell people quick bit about yourself.
00:42 Sure. I'm a developer in Microsoft and Microsoft Threat Intelligence Centre, doing Microsoft quite a long time. only relatively recently, like four years so ago, got into Python coding with Jupyter Notebooks. So I work on Jupyter notebooks for the Microsoft Sentinel project, and own a modest open source package that will called mystic pi. Sure, cover a little bit later. Thanks most of my time.
01:07 Fantastic. The whole cybersecurity threat detection stuff. It's it's very interesting. There's a lot of innovation there. But it's also it's a challenging area. It'll be working.
01:17 Yep. Yep. We're never sure stuff to do.
01:21 I'm sure you're not well, Brian, how about you kick us off here?
01:25 Well, so I'm going to start off with a problem. So I, I had a problem, and I have a cool solution for it. So my problem is on testing code, I've got titles, and I want to end a show on its mp3 file. But I want to create show notes, automated show notes, or not show notes, transcripts. So one of the problems, there's a lot of problems in doing this, trying to automate it. But one of them is the title, I want to turn that into something that's a little bit. So something like, you know, it's got normal English and capitalization and all sorts of spaces and stuff. I want to turn that into things that URLs hate Yeah, I want to turn that into a URL. And, and one of the problems one of the things is getting rid of stop words. So there's a bunch of stuff like lower casing, I can do that easy. But getting rid of stop words was a little hard. So I ran across this, this thing called Gen sim parsing, pre processing things, so aren't pre processing, to Gen. Sam is a larger sort of beast. It's a it's a used for machine or machine learning and stuff to generate models. But I am I'm just really using one little piece of it the pre processing part. And it's, it's really pretty cool. I was looking, I actually found this article first there was an article called removing stopwords from strings in Python. And it has it has a discussion of NLT K, and gensim. And Spacey. I tried all of them out actually. And the one that really stuck best for me is a using talked about using remove stopwords is exactly what I wanted, right from from Jensen. So I went ahead and tried that and it worked really well. But I'm like, wait, I'm pulling this in from the pre processing library. I wonder what's, what else is in there. And there's all sorts of really cool stuff in here. There's a Lowercase Lowercase to Unicode, it turns it both into lowercase and in Unicode, that's pretty neat. Don't think I need it, but that's neat. But then there was one that was pre, I thought maybe this is exactly what I want is something called pre process string. And it has a whole bunch of filters built into it. Oh, nice.
03:41 Like strip strip. Yeah, strip whitespace strip punctuation. I
03:44 love it. Yeah, and take away multiple. After strips punctuation, like you're going to have a go back, add a slash in my title for one of the episodes. If it takes that out. I'm going to have a space before in the space after. So I want to remove those. So it'll strip multiple whitespace strips out Numerix because I probably don't want numbers in there. And then remove stop words. The one thing I don't want that I'll have to like, customise how I'm calling this is a stem text. So stem text. I don't know what that did without playing with it. But what it does is it will take things like twisted and turning and turn it into twist. That's that's really not right. So
04:23 you definitely don't want that. I
04:24 don't want that I don't mess it up. But I think I want everything else. So this gensim Library has it. You know, if you're doing machine learning or coming up with models, I think this is a great tool to look into. But if it's actually I'm going to use it just for removing to create these titles for for, you know, my podcast, but the I think it feels a little weird. It feels like I'm using this really big hammer to do this little tiny problem. I guess I'm okay with it. But you know, do you have any other ideas? We could use her well
04:59 I didn't know about this. So I wrote my own, okay. And it's it's kind of janky like it's a little bit a little bit recursive, iterative. It's like, we'll take away all the punctuation now, turn all of your white spaces into single white spaces, because there might have been, you know, dot space. So now you got two white spaces, but you've got to take away, you know, there's like a bunch of weird steps and then put it back. This looks cleaner. It is a dependency, but it does look cleaner. I like this. I think it's, I'm glad to know about it. In what do you think
05:28 you see a huge thing. I mean, dependency, but I always think of like ML, like stuff. But this is like just the pre processing, right? So
05:36 I'm actually pulling in of all of gensim to get this I don't know if I can pull in a little bit. But it's, it's not really part of my application that I'm shipping is just a tool that I'm using on my laptop. So I guess downloading at once doesn't really bother me too much, even if it's a big thing. But cool.
05:53 Yeah, I was thinking, Yeah, that's
05:54 a good, that's a good point. If it's running local, it's like a dev dependency. Who cares? Right? It's like worrying about how big pi test is like, it doesn't really matter.
06:02 Well, I kind of get care about that, because CI is gonna pull it in all the time for PI test, but,
06:08 but they got fast networks. That's not your bad boy. That'll be all right. One of the things that struck me about this, that made me think of your situation is like that lowercase, to Unicode. And so many times in the security space, it's about like, you're checking for this representation. But what if there's another representation that means the same thing, like you don't say, go to this directory, you say go dot dot, and then over there, you know, those those kinds of non canonical representations? I wonder if there's any use of this kind of stuff for you?
06:39 Yeah, there's something I kind of touched on pigment section later on, which like the attackers typically write scripted attacks and try to obfuscate code using a mix of uppercase and putting random dots. I'm just looking for a nice, potentially nice way of kind of cleaning some of that, that stuff up.
06:56 Yeah, for sure. There was a, there's been some interesting supply chain vulnerability stuff from him. Remember the guy with the colour? And I think the Faker stuff in JavaScript that sabotaged his, his libraries. There was another one that maybe well intentioned, I don't know if it was some open source library, I don't believe was Python. I can't remember what it was. It could have been. I'm pretty sure it was in JavaScript, because that's where all most of the bad stuff was, it seems. Anyway, they wrote there, they taught their dependency to a race, everybody's hard drive, who installed it, who was in Belarus and Russia, which, okay, maybe they're trying to control, but like, it ended up doing a bunch of bad things, even to places that were like trying to help say, people in the press and journalists do certain things, like, you know, connect with sources and erase like that database as well. And what they did to make it so that nobody would notice in the GitHub commit before it went out to NPM was base 64 encode their changes. So basically put a base 64 encoded string and then like, decode and then run that. And yeah, it's like that kind of stuff. I know this won't solve that problem. But yeah, that's sort of category of like weird representations. You
08:10 need mystic pie for something like that. It's one of the things we're common thing kind of basics before decoding before the obfuscating.
08:18 Yeah, interesting.
08:21 Yeah, I thought maybe using something like that with, because one of the problems we have is like, every script is kind of slightly different. And if you could use something like that, essentially kind of apply, like sentiment analysis to script them. And this is a big problems. It's not something I've particularly sold. But there might be a kind of useful, useful thing of just picking out certain things that indicate malicious, like format for my drive.
08:47 Exactly. Yeah, you can certainly represent like, this one does hard drive stuff is this. I thought it was parsing colours, why the doing things with the hard drive? This is odd, you know, like, or with the network, stuff like that. Cool. All right. Well, you know what you would really want to check out if you were trying to research these things, probably documentation. So I want to tell you all about dev Doc's Dev, Doc's dot io, this is pretty cool. Now, when you get there, it's an interesting on my Firefox, it's just got like the mobile view, which is really odd. If you go there with a full browser, it's what it believes is the full browser, I guess it's like a slightly different view that's pretty similar, but not the same. So there's, you open up a whole bunch of programming technologies, let's say not just Python, or JavaScript or something, but there's also Vue js, there's WYSIWYG, for example, like some of the foundation of flash, and you can pick the particular versions and stuff do you can go in and like enable these different things. So maybe I care about view, I can go over here and enable that one. Let's we definitely want some Python. Let me go find some Python gives you all the versions on that. And let's say I'm also working with Postgres. So I'll enable that documentation and then I might be working with engine X for the front end, which is somewhere Here's one go enable that. And then it will be up near the top, somewhere here, you can see these are either the default ones or the ones that I checked on. So then you can open them up and say, I want to go and see the nginx guide, about a debugging log. And then it takes you to the documentation for that technology. So it's like a meta documentation repository for all of these things all at once, which is pretty cool, right? So I can go up here and search, I want to know about like, Let's go about, like media tags or something. So you can see the stuff in html5, you can see the stuff. And when you say media, it looks like media. And so you can see that in the statistics module for Python, some stuff for CSS, or you could say like, I just want to search for CSS. And then you get like using media queries and how to do that kind of stuff. So it's kind of a, what you do is you turn on the pieces that are relevant to you, and then you can search across those technologies. Cool, right, huh? Well, yeah, and, and then, if you're on the move, you can come over here and turn on offline and offline data. And it'll download all of that has an app so that then you're the coffee shop, and you're playing you now have all the documentation for Python, 310, Vue js, very exciting, engine X, etc, etc, that you can use, which is pretty cool. And this is something that drives me crazy about Firefox, they had it and they took it away. And I don't understand why feelings. firebox is about what the web. So they took away the ability to do progressive web apps in Firefox, but not all the Chromium browsers support it. So you can actually go and instal this as a dedicated application on your system. So you, if you have no web user, click that open. It's its own window, you can up you know, Alt Tab command tab between it's super easy. And then turn on the offline mode, you basically have an app that has offline documentation for all the programming technologies that you care about. So this is our this is my new coffee shop, buddy.
11:56 Is the search go across the thing you've selected then? So if I search for Yeah, replace or something? It's the things I've selected?
12:03 Yeah. So if you turn on like JavaScript and Python, it would look for that in both languages. Okay. Yeah. So basically, the ones you turn on, there's a tonne of them, right. And you pick that you say, these are interesting to me. And then search and stuff from what I can tell only applies to the technologies you say you care about. Because like, if you don't use Java, you really don't see the documentation for Java search, right? That would be useless.
12:24 And one of the things I like about this is it also has versions. So if you're using a, like an older version of Postgres, you can just enable that version,
12:33 right? Sometimes it doesn't matter very much. But other times it matters massively like Bootstrap three and Bootstrap five, right, fully compatible, basically, like they're totally different keywords and grid systems. And you don't want just the latest, if you've got an old app, you're working on something like that. Pythons more forgiving about that kind of stuff. Right? It doesn't break as often.
12:52 I was amused that the the list though, is a it is like 3938 for Python, and has returned at the bottom because one is obvious because it's
13:03 alphabetically sorted out interesting. Ian, what do you think of this?
13:08 That's pretty cool. I'm amazed. Is somebody dev Doc's kind of manually maintaining all of the links to these? Like the original source documentation?
13:17 Yeah. Where are they getting it from? Right? Because there's, they're super desperate. It's like matplotlib, and Markdown and Maria dB. These are all unlikely. They're all stored in the same basic system, right? I don't know how they get them, actually.
13:30 Yeah, that's really cool. I don't know I normally have solved the same problem by having like, 130 tabs open to different bits of Python docs and pandas. And
13:38 exactly, exactly. Yeah, I'm pretty sure they got pandas in here. I got NumPy as its own thing that we saw matplotlib there's pandas, and there's even, you know, versions of pandas across their
13:49 single attempt solution. Brilliant.
13:52 Yeah, it looks looks pretty good to me. All right. He might tell us about what you got for your first item.
13:58 Okay. Sure. Yeah. So, as I mentioned earlier, I own a package called mystic pi. And first thing to sort out with it is the spelling, because I suffer from this on a daily basis, Miss typing it even though I've ended for like three or four years. So it's MST, AIC standard for Microsoft Threat Intelligence Centre. There's no y or anything like that in there. So it's a tool set for cybersecurity investigations and hunting in Python Mayline Jupyter Notebooks. So a couple of questions to ask about that. So firstly, what is cyber cyber cybersecurity hunting and investigation? And what? Why are Jupyter Notebooks useful? So the first one cyber SEC investigation is really responding to alerts or other kinds of threat intelligence and trawling through typically large amounts of security logs, and cloud services hosts Account Services to determine whether this is a real threat or not. And there are two main commits one
14:55 of the huge problems right is you've got all these different systems. How are you going to If someone if you don't have a tool like this, how are you going to know that some things? Someone's in there rooting around, right? Yeah,
15:06 yeah. And there are a couple of things that usually trigger this kind of search. So one of them is a, an alert may be coming from your scene that says, that stands for security information event management to the console like ArcSight as traditional one or Microsoft Sentinel as a cloud based one. So you get an alert based on the rule, and you need to go in a fairly managed process, somebody needs to go and investigate, is this a real threat? Or is this just noise, or there might be something like the SolarWinds, they never a year ago, the log for G, like something in the press or something from a threat intel kind of alert, says this kind of threat is around, and it's a more ad hoc process kind of hunting? Like, do we see this in our organisation? So that's kind of what mystic pi is trying to, you know, try to address the needs of that. And the second question is why Jupyter Notebooks? Why would you do in a Jupyter Notebook, rather than in your existing sock tools? I mean, I think there's a lot in common, this kind of activity is not uncommon with like big science data. Sorry, big, big data science and something like astronomy, where you're kind of, you know, hunting for an adversary activities a little bit like trying to find exoplanets in kind of gigabytes of data, or a new Quasar or something like
16:25 100,000 stars or 100,000 lines of log file. Yeah, I think for some patterns and stuff, and
16:31 you got a few photons, you're trying to determine are these kind of different, you know, something like, like, atmosphere activities, a little bit like that, like millions and millions of events, and you're trying to find the bad stuff. So traditional sock tools are kind of, you know, can be really excellent that I work with one that I think is, is really good, but, but they all have limitations with a sock tool, a sock, sock Security Operations Centre. So something like, you know, a console that fires alerts and tells you that they have a bunch of analysts engineers looking at this output of this and deciding, and that's the trigger for their investigations. They like
17:08 they're like failed logins to SQL Server, yes, something like
17:11 that, or, you know, could be more sophisticated thing like something's exit, you know, trying to access the password data on this, it looks like it's trying to access password data on this host or, or has made a weird kind of configuration change to mailbox settings. So all those kind of things can kind of trigger alerts and investigations. But you are limited in most kind of Operation Centre environments, that notebooks allow you to kind of break out of some of the constraints of that. So firstly, you can get data from anywhere, you're not just limited by kind of what's in your logs, you can go to VirusTotal. Or you can bring data from anywhere, you can use customise kind of analysis. So write your own or get get things from pi pi. And lots of people have kind of written this stuff, you control the workflow. So so you don't have to follow what the tool says you can reorder things you can backtrack, redo things, and the workflows repeatable. So if you get a similar kind of, you know, issue, again, similar kind of alert, you can fish out an old notebook and rerun the same kind of analysis. And you end up with a nice kind of shareable document that describes your investigation, a bit like the results of a scientific investigation. It's like, here are all the steps I took. And these are the results. And this is what they this is what he determined to be the bad. The bad activity.
18:33 Right. The other thing that seems useful here is Jupyter. Often the notebooks will save the last bit of computed information. And then you can go change a cell ask the question, again, without rerunning the whole thing. And like, if that's parsing tonnes of logs or pulling them over SSH, or whatever that not doing that, again, is nice.
18:54 Yeah. And it's brilliant. If you don't like doing those queries in different browser attempts and your browser crashes, they've all gone. What do you do? All they need to be No, because they would like second by second after you do it. You can just go back and you can go back to things you may have done months ago.
19:10 Yeah, absolutely. Yeah. So
19:12 when I started all of this, I kind of thought a lot of this stuff for cyber investigations would be available on pi pi for a great Jupyter Notebook seemed like brilliant, and there's going to be processed Tree Viewer and there's going to be an event timeline and all this kind of stuff. And I found out there wasn't, at least I couldn't find it. So I decided to like stop everything need to stop biting this stuff. So it turns out that things like visualisations you need for detecting exoplanets are a bit different from ones you need to detect bad actors. So, so we started building this thing originally me but there's now Pete Bryan and Ashley Patil also kind of working on it to my colleagues, Matt and a bunch of people in the community. It's got four main functional sections displaying data query. How you get data in how you do templated queries is enrichment. So for example, if you have something like an IP address, you have a bunch of questions about it as an analyst, like, which geographical location is this IP address from? Does it have any malware reports about it? Third areas analysis things like anomaly identification like that thing. You've talked about a spike in in failed logon events, unusual spiking, failed logon events, that kind of thing. The final area is visualisations. And these are like more specialised, I've got a couple of examples in the show notes. This is like a anomaly identification pattern. This is one of one of the custom, we use Boker which is really nice cuff visualisation package to allow you to kind of view data in a way that analysts kind of expects it to, to see it that so the more this kind of visualisation, more traditional car graphs,
20:54 I would much rather look at this than log files, or event logs or whatever, you know,
20:58 yeah, that's the whole thing about you know, you might have 1000s of events, and you need to get down to a few that are the interesting, the interesting thing. So one of the areas that we've we tried to focus on currently, we wrote all this stuff, and you have like, hundreds of functions that you could use, but it's kind of difficult to discover them. And they all because they evolved a little bit organically, like how do you they'll work in a very different way different set of parameters. So the work that we're currently doing is trying to make this all a bit more accessible. So all of the functions that relate to say an IP address, or the questions you want to ask about it are kind of dynamically attached to a class called IP address. So they're all like things.
21:41 Oh, interesting. Doo, doo doo, you don't have to work this with a raw string or just some raw IP representation. But you can ask it questions like its location.
21:49 Well, it's not quite that intelligent. So even bit less intelligent than Alexa. But, but it's, but it's more like, you know, there might be things like geolocation of an IP address, threat intel lookups different queries that might be have IP addresses like a parameter. And previously, you'd have to go and find all of these things and import them separately and run them. But now they're all kind of dynamically attached as methods to the fact that use IP addresses a parameter means that you just have one object to import, and then you can do all of these different operations on this single item. There's some things that don't work with that some things like the visualisations, for example, they not IP address or host or account specific they work on big blocks of data. So the other area we're working on is try to anything that takes a bunch of data as an input. We're writing those pandas accesses, so they appear as methods to a data frame. So you do kind of data frame.np plot dot timeline, right. And it would produce your timeline, as long as it's the right kind of data or So yeah, that's one of the challenges of writing this kind of thing organically is, you end up with a lot of stuff, but nobody knows it's there. And everybody else had to import it. So try to make it as accessible. So that it just becomes a very intuitive thing. Oh, I have an IP address. What functions can I do? I could do this tab completable. I think
23:14 yeah, I think it's really cool. Taking this Python data stack view of cyber security and threat detection. Yeah. Yeah. Brian, what do you think?
23:23 Well, it's definitely a complicated area. And trying to one of the things I like about this story is just talking about the complexities in API design. And discoverability. That's a place to like lots of different fields. But
23:39 yeah, it's one of those things you should have thought about at the beginning. But even at the end, you can tidy things up.
23:46 Yeah, so famous last word.
23:50 So yeah, we're definitely open for like, other people collaborating contributing stuff. Because there's a lot of ground to cover.
23:57 Yeah, for sure. It's on GitHub. I saw. Yep. One final question before we move on. Is it just for Azure? Or is this a thing that more broadly works across different systems?
24:08 No, I think I should have mentioned that a little bit earlier on. We recently built it for Microsoft Sentinel notebooks. But it supports like Splunk defender, we're working on elastic provider. So really, anything you can get into a panda's data frame, you can use most of the functionality. So even if we don't, we don't have a provider ourselves. If you've got something like pi Spark, you can get a data frame then all of our functions take data frame. You know, we use pandas as our universal Data Interchange Format.
24:37 Yeah, Indeed, indeed. Kim Van Wyck out the audience likes it. It's way like a much nicer way to glean info and logs and complex grep. I'm, I'm right there with you. Now, before we move on, Brian, let me tell you about our sponsor for this episode. This episode of Python bytes is brought to you by fusion off fusion auth is an authentication and authorization platform built by devs. For devs, it solves the problem of building a central user security without adding risk or distracting from the primary application. Usually auth has all the features you need with great support and a price that won't break the bank. And you can either self hosted or get the fully managed solution hosted in any AWS region. Do you have a side project that needs custom login and registration, multi factor authentication, social logins or user management? Download fusion off Community Edition for free. The best part is you get unlimited users and there's no credit card or subscription required. Learn more and get started at Python bytes.fm/fusion auth links in your show notes. Thank you to Fusion auth for supporting the show. Alright, what do you got for your next one, bro? Number numbers, something every computer scientists should know. Yes, floating
25:49 point. Arithmetic is complicated. And so when I started, started working in professionally, one of the things I was recommended reading was an article called what every computer scientists should know about floating point arithmetic. And don't worry, it's only like a really long paper with lots of math. So I am not telling you to read this, although it is an interesting read. What I would like you to read is this article by David Amos called the right way to compare floats in Python, because there's a few things that we need to know about floats when we're using him and floating points is, and he covers all of this in the article without going through tonnes of scary math is the floating point numbers have to be represented in a way that the computer can store them and use them and manipulate them even though some numbers are huge and won't fit normally. So we have to do things like accept that there's an error and rounding. So there's a little bit of a discussion there that he talks about one of the things that surprises people, sometimes when they first come come into Python, but it's not just Python, it's most most languages is somewhere there's going to be something obvious that doesn't work. Like in the Andes, or David's example. 0.1 plus 0.2 equals or comparison equals a 0.3. And that will show up as false because they don't and this is weird. They obviously
27:17 are Oh, crazy that that doesn't work. But but it's not
27:21 just equals you can also do comparisons, like, you know, less than or greater than one. So not only is that are they not equal. They're not like point one plus point two is not even less than or equals point three. It's weird. So, so what do you do? You don't? The gist of it is don't compare things with normal math comparisons if there's floating points involved. So what you want to do instead is there's here's a little tiny bit of math way less than the, than the example,
27:53 that thesis dissertation. Yeah. So there's a whole bunch of stuff
27:57 built into Python that you can start to work with comparisons. And one of the most common ones of trying to get there is math is close. So there's a math library, that's is that within is close function, that it's used to just say, Hey, I've got two values are these close, close enough. And we win if you're using if you have to compare floats, something like this is great. And B underneath the seat, behind the scenes, what it does is it's a, it's taking the two values and subtracting them. And figuring out if the delta is or the absolute value of the delta is below some tolerance, some reasonable tolerance, like close enough. And what that tolerance is, is either a relative or absolute tolerance. And you most of the time, you can kind of get away with not caring about that. But if you do care about it, you can control that you can pass in what tolerance you expect things to be closer to, I use stuff like this all the time with, with test equipment, because I definitely want to know, control over the tolerance levels. So yeah, for sure. So there's math is close. But then there's also I'm not going to scroll all the way down here. But there's, there's he also covers NumPy. So NumPy has got a couple of these that are really great. One of them is, is close also, but it works on arrays. And although give you an array of true and false values, but you can also use all close, which just says you've got two arrays, and if all of the pairs are close enough, it'll match those up. Also covered which we use during testing a lot is pi test procs, which is a little bit of a different beast. But But David covers that. So basically, this is a semi regular reminder to anybody using floating point math in Python that you should be careful with it or any other language.
29:51 So yeah, it's not a Python thing. It's just a fit, representing Yeah, and that don't fit.
29:56 Now there's some things sometimes where you have to be very exact You need to be very precise. And in those cases, Python does have the decimal and fraction types. And David covers these in the article, which are cool. They're cool things to know about, like, definitely around people using money or, or other very high precision. But if you're also said that there's those are covered, they do get some sort of a hit for those. But if you really care about like the precision and want to want to do things exactly right, then you probably should read that larger article, because there's things that you have to do like, certain operations before other operations to try to keep the area of error from accumulating too high. So it gets messy missing,
30:40 I think, fundamentally disturbed by the idea that zero isn't zero. So my approach to floating point numbers, there's no way to convert them to insulin.
30:50 Yeah, I was thinking that. Yes, sometimes that is the way to do it. Right. I was thinking this kind of stuff maybe applies a lot to the project that you're working on. If you're trying to come up with ratios that represent, you know, how risky something is and things like that? Yeah,
31:07 yeah. Yeah, certainly love. Yeah, I was being a bit flippant before. Just a sprint. It's like I'm very platonic at heart, I think so. Like 01 should be 01, not nearly one of New Zealand, there should
31:20 be a perfect square in a perfect circle, like how can they not exist in our language?
31:25 Is it really zero or negative zero.
31:32 Henry on the audience, he also points out the PI test approximate also works on NumPy arrays as well, I just put that all together. Alright. Let me tell you all about hyper think that's, that might be the representation, the way you pronounce it. Everything needs its own description, its own like little phonetic bit. So this, this is a simple way to create scripts that run and do stuff on your computer using Python. And what's cool about it is it has a real simple way to define the steps. Some of those steps can be optional. But then you can also piece together things like other programming, so you can combine commands, different scripts in different languages and applications all into one sequence of events that happens on your computer. So it's basically a task runner where you define stuff in Yamo. And probably the best way to see is to go check out the docks, there's a whole bunch of docks, the docks are really nice here, actually. So for example, if you go to getting started, and come down here and run your first pipeline, I really like the way the docks here, look how they look. But the way to find out here's like, a one. One step one is you just say the steps and it's all EML can give a step a name, so you can refer to it. And then you have inputs and outputs. And you do little curly string interpolation types of things. Or you can have more complex ones, like with different steps, you can even have little comments, there's a way to put a comment in your YAML file as well. So there's also conditional, let's see if I can find a good conditional one down here. Here's one that goes and works with like, this one is just an echo statement and the ping command, but you know, whatever, whatever you want to do, you can basically pass command line arguments to the Yamo file or to the workflow the pipeline, it'll take those and feed them into the steps. So for example, when you call it, you can say like count equals one and IP equals that and those will will come the little string interpolated pieces that go in there. So you can just combine whatever, basically whatever commands are available to the shell, right? Be that Python, or POSIX, or Windows or PowerShell, or whatever you're looking to do. Pretty cool, huh?
33:43 That's pretty, I might need this for my, my job of automating my show notes might use this.
33:50 Oh, yeah, there you go. If you can find this, go do that. And so like, here's one that sort of uses the truthiness. So it says there's a bunch of different steps. And the you can use the run flag. So here it says run if there's a value for a on this one. And this one says run if there's a value for b, and then there's an example where it says, Okay, we run it by itself, those don't run. But if you pass a, then it runs out a step past be it does the beast app, or I can do both, if you pass them both, and I like the simplicity of it, like a lot of these tools like this feel like they're pretty complicated. You know, you're sort of like your example with Ganeshan. Brian, or like, is this thing too heavyweight? For what I'm trying to ask it to do? You know? And this seems like a real simple thing. And I don't have to learn about make or any of those kinds of things. Yeah.
34:35 GitHub actions or? Yeah, yeah. Yeah. It's
34:39 kind of a bit of a GitHub actions feel to it.
34:41 It's just seems like a nicer kind of declarative. Let's be cool.
34:45 Indeed. Yeah. If you're not a not into programming, or you didn't want your steps to be programming, but of course, what happens at each step, you could call a Python app or script that's going to do something complicated, right? If it needs to, can you make the orchestration of that you don't have to make comp Good.
35:00 Is it just a command line tool? What can you evoke it from Python might be a
35:04 bit interesting. I'm sure there's, there's a way to import it and make it do do a thing. You know, it's probably just a Python package with an entry point in its package. So I would think so, yeah.
35:15 So it'd be nice to be able to do that, rather than just using some process to invoke a lot of things. Like if you're
35:20 interested, I hadn't really thought about it as a replacement for sub process. But yeah, because a lot of times when you're trying to orchestrate stuff like it talks about here being part of the shell or being another app or another language, you would just use sub process on it, right? Yeah, cool. Well, there it is. Piper piper.io. And people can check that out. Looks looks pretty interesting. Is right in Mont take us out with your final item here,
35:44 ah, pigments. Okay. So this is a package. I mean, if you're a developer, there's a very good chance that you have been using this for years without no like me, without knowing about it. You might have seen it being installed as like a dependency is like, what is that thing?
35:58 That was my thought? And I'm like, I have no, I see this all the time, my dependencies, and I just never really bothered to look into what it does.
36:04 Yeah. So I hadn't until recently. So if you're used, if you use Jupyter, notebook, markdown, you know, you can look like three backticks, and then a block of code. And you can actually put like Python, or bash or something, and it will intelligently highlight it. So the thing that's doing that intelligent highlighting is pigments, GitHub, markdown, same kind of thing. I'm not sure whether GitHub uses pigments. And if you do developer docs, like read the docs, and Sphinx, that also uses pigments to kind of colour code your, your code samples. And I know there's a lot of, you know, writing kind of blog posts and stuff like that you there are some quite a few services out there where you can take a chunk of code, and it will intelligently highlight it and give you a jpeg or a png back. And that's kind of nice, but then you can't copy and paste the code from those samples. I don't like that, really, I think if you're going to put code in a article, you're probably intended for people to be able to copy and paste it.
37:05 Yeah, that's the most likely thing you are to copy and paste here, right? Because they want that code over here. You
37:10 don't want to limit you could use OCR to, like reinterpret it, but so yeah, and then maybe maybe Brian's gensim, to like, tidy it up. But so with pigments, you can use it as a standalone package, and it can do this kind of rendering. And it can render to like HTML with like CSS style sheets for all of the coding and also render to like NC terminal latex, a few other other kind of things. So if you're using, if you want a nicely formatted piece of code in a document, or you're doing developer docs, it's certainly kind of useful. And I came across it, I should say one thing, it also supports, maybe I can just switch supports lots and lots of languages. So it's very simple to use. It has a highlight function. And then you import a lexer, which is like the thing that understands the tokens in a language and the format for the output type you want. And I think there's hundreds of these things. So
38:10 there are a lot of languages in there. No kidding.
38:13 More than half of these I've never heard of any also supports as well as things like you'd expect Python, it supports Python trace back, so it has separate lexer for colour coding, trace backs, all the usual languages you'd expect, but also some things like data formats like Tomball. Jason XML.
38:33 Okay, interesting, like a lot of the files that we might run across, you get syntax, highlight them. Yeah.
38:39 And so it's very easy to use. And the reason I came across it is because I recently saw a lot of attacker code tends to be deliberately obfuscated. So it's kind of like base 64 encoded, but then even once you decode it, it's kind of managed in a way to make it unreadable as possible. So one of the things that we tried to do is, is pull that code back, like decode it, try to reef like clean it the obfuscated. But if you have, if you can present it in a, as close to the way a developer would write it as possible, it makes it much quicker for an analyst to determine what what is this doing. So we use it now in Mystic pi to kind of colour display things like malicious PowerShell script or bash or something like that. So that's how I came up with actually rather than just seeing it go past as part of a pip instal actually have to invoke it directly. So so I kind of big shout out to the developers and maintainers of pigments here. It's one of those package that probably millions of people benefit from but like very few people kind of know about it, or you can and it's just super easy to use. They seem to be adding kind of Lexus all the time. So brutal.
39:51 Yeah, this is amazing. I didn't realise that it did all of this. This is way more advanced than I thought. Brian did you know?
39:58 No, I just thought it was Something that magically syntax did syntax highlighting. I didn't have to care about it.
40:05 Yeah, exactly. Good. That's yeah. Little example. In the show notes as well, I pasted,
40:12 it has a dark theme. Yeah. Yeah. Yeah. And you,
40:16 you probably want to include this. No background equals true for using Jupyter Notebooks. Because if, if you select a theme, it just flips the whole notebooks kind of CSS theme. So that tells it just not to mess with what, what's in the background.
40:28 Okay. Yeah, that looks great. Yeah, thanks. Thanks for pointing out how useful that can be. That's, that's cool. Like I said, I've seen it go by all the time. I just never really it's, yeah, paid that much attention to it.
40:39 It's probably a pretty minority use. But plug, if you need it. It's great. Yeah,
40:43 it's incredibly powerful. Fantastic. Well, that's all of our main items. Brian, you got an exercise?
40:47 Just one extra actually, one of the things when I was doing that, the first topic gensim. The one of the dependent doesn't have very many dependencies. But one of the dependencies is this, this library called Smart open. And I'm like, what I, I open things, and I want to be smart about it. So check this out. And it's pretty neat. I don't know if we've covered this before. But it's a it basically mimics the interface of open normal Python open, but you can pass it really anything it does, like transparent on the fly reading of things efficient streaming of large files from like s3 or Azure, or over over the web
41:31 straight. Just HTTP. Yeah. If you just have a link to a large file on a web server,
41:35 yeah. And then just the code for it is just like super nice. You know, you import open from smart open and you got like for line in open this thing. And just, you can work from each line there. It's pretty cool.
41:51 I love it. That's a great one. Nice. Any got any extras? You want to shout out while we're here?
41:58 I don't, I'm afraid.
42:02 I have, I have two real quick ones. To just quickly talk about last time, Emily Morehouse spoke about using auto squash, which was really cool. So Adam, get the attribution correct here, Adam Park Parkin sent in a follow up to say, hey, you should check out this article over here called fixing commits with Git commit --fix up and git rebase --auto squash who belong the short of it is talks about doing a lot of things that Emily said was pretty cool. But in the end, setting up your git config to auto squash equals true and then adding an alias. So you can just type git space fix up. And when you type that it actually does git log and shows the last 550 items, and then allows you to go back and work with those. And basically, it's just a real quick way to get back into the scenario where you mark different elements for fix up. So people can check that out if they were following Emily's advice, but they want it to be like one line. They don't have to remember. There you go. That's cool. And then Python 310. Three is out as of about a week ago, I suppose. So there are many changes amongst here. You know, I would love there's like so many great changes here. I don't know how many do you think that is probably 100, maybe a little bit less, it would be great if there was like a, these are critically important. At the front, like there's a security problem that was fixed, or there's a thing we've taken out is no longer here. They're kind of all the same priority. But nonetheless, there's a bunch of changes that people can check out and you upgrade to the newer version of Python 310
43:42 Different people care about different stuff, though.
43:45 I know I don't want to post my importance on other people's importance.
43:49 So it's funny why right? When I first came, I first came across Python, you can like why is it so slow between major versions coming out? But then suddenly, it's like a Python developer. It's like, why are the versions coming out so quickly?
44:02 Yeah, it's definitely true. There's a tonne of change. This is just, you know, some minor version Jas that has these all these changes in here, which is pretty cool.
44:11 Well, we also used to be on an 18 month cycle, and now we're on a yearly cycle. So just Yeah,
44:16 yeah. lukesh Link is fault that we are 50% faster now. Thanks, look. Alright. How about a joke to close out the show? That'd be great. Yeah. So here's a good tweet. And it's this sort of perplexed, I think, in a good way, character wearing all these are these prizes. I don't anyway, Python developers, when someone asks what their secret is, this person just says, I just keep writing pseudocode and it just keeps working. So a little bit like that joke where they have some code pseudocode in a text file, they're like, just rename it to.py And try to run it. Anyway, that's the joke. Nice. Thank you, Brian, as always, and Ian, thanks for being part of the show. Thank you. have you here
45:00 thank you very much both been real pleasure
45:02 yeah sure has to