#276: Tracking cyber intruders with Jupyter and Python

Published Wed, Mar 23, 2022, recorded Tue, Mar 22, 2022

Watch the live stream:

Play on YouTube

Watch the live stream replay

About the show

Sponsored by FusionAuth: pythonbytes.fm/fusionauth

Special guest: Ian Hellen

Brian #1: gensim.parsing.preprocessing

Problem I’m working on
- Turn a blog title into a possible url
  - example: “Twisted and Testing Event Driven / Asynchronous Applications - Glyph”
  - would like, perhaps: “twisted-testing-event-driven-asynchrounous-applications”
Sub-problem: remove stop words ← this is the hard part
I started with an article called Removing Stop Words from Strings in Python
- It covered how to do this with NLTK, Gensim, and SpaCy
- I was most successful with remove_stopwords() from Gensim
  - from gensim.parsing.preprocessing import remove_stopwords
  - It’s part of a gensim.parsing.preprocessing package
I wonder what’s all in there?
- a treasure trove
- gensim.parsing.preprocessing.preprocess_string is one
- this function applies filters to a string, with the defaults almost being just what I want:
  - strip_tags()
  - strip_punctuation()
  - strip_multiple_whitespaces()
  - strip_numeric()
  - remove_stopwords()
  - strip_short()
  - stem_text() ← I think I want everything except this
    - this one turns “Twisted” into “Twist”, not good.
There’s lots of other text processing goodies in there also.
Oh, yeah, and Gensim is also cool.
- topic modeling for training semantic NLP models
So, I think I found a really big hammer for my little problem.
- But I’m good with that

Michael #2: DevDocs

via Loic Thomson
Gather and search a bunch of technology docs together at once
For example: Python + Flask + JavaScript + Vue + CSS
Has an offline mode for laptops / tablets
Installs as a PWA (sadly not on Firefox)

Ian #3: MSTICPy

MSTICPy is toolset for CyberSecurity investigations and hunting in Jupyter notebooks.
What is CyberSec hunting/investigating? - responding to security alerts and threat intelligence reports, trawling through security logs from cloud services and hosts to determine if it’s a real threat or not.
Why Jupyter notebooks?
- SOC (Security Ops Center) tools can be excellent but all have limitations
- You can get data from anywhere
- Use custom analysis and visualizations
- Control the workflow…. workflow is repeatable
Open source pkg - created originally to support MS Sentinel Notebooks but now supports lots of providers. When I start this 3+ yrs ago I thought a lot this would be in PyPI - but no 😞
MSTICPy has 4 main functional areas:
- Data querying - import log data (Sentinel, Splunk, MS Defender, others…working on Elastic Search)
- Enrichment - is this IP Address or domain known to be malicious?
- Analysis - extract more info from data, identify anomalies (simple example - spike in logon failures)
- Visualization - more specialized than traditional graphs - timelines, process trees.
All components use pandas, Bokeh for visualizations
Current focus on usability, discovery of functionality and being able to chain
Always looking for collaborators and contributors - code, docs, queries, critiques
https://github.com/microsoft/msticpy
https://msticpy.readthedocs.io/

Time series analysis for identifying anomalies

Process tree visualizer

Threat intelligence browser

Brian #4: The Right Way To Compare Floats in Python

David Amos
Definitely an easier read than the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic
- What many of us remember
  - floating point numbers aren’t exact due to representation limitations and rounding error,
  - errors can accumulate
  - comparison is tricky
Be careful when comparing floating point numbers, even simple comparisons, like: >>> 0.1 + 0.2 == 0.3 False >>> 0.1 + 0.2 <= 0.3 False
David has a short but nice introduction to the problems of representation and rounding.
Three reasons for rounding
- more significant digits than floating point allows
- irrational numbers
- rational but non-terminating
So how do you compare:
- math.isclose()
  - be aware of rel_tol and abs_tol and when to use each.
- numpy.allclose(), returns a boolean comparing two arrays
- numpy.isclose(), returns an array of booleans
- pytest.approx(), used a bit differently
  - 0.1 + 0.2 == pytest.approx(0.3)
  - Also allows rel and abs comparisons
Discussion of Decimal and Fraction types
- And the memory and speed hit you take on when using them.

Michael #5: Pypyr

Task runner for automation pipelines
For when your shell scripts get out of hand. Less tricky than makefile.
Script sequential task workflow steps in yaml
Conditional execution, loops, error handling & retries
Have a look at the getting started.

Ian #6: Pygments

Python package that’s useful for anyone who wants to display code
- Jupyter notebook Markdown and GitHub markdown let you display code with syntax highlighting. (Jupyter uses Pygments behind the scenes to do this.)
- There are tools that convert code to image format (PNG, JPG, etc) but you lose the ability to copy/paste the code
Pygments can intelligently render syntax-highlighted code to HTML (and other formats)
Applications:
- Documentation (used by Sphinx/ReadtheDocs) - render code to HTML + CSS
- Displaying code snippets dynamically in readable form
Lots (maybe 100s) of code lexers - Python (code, traceback), Bash, C, JS, CSS, HTML, also config and data formats like TOML, JSON, XML

Easy to use - 3 lines of code - example:

from IPython.display import display, HTML
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter

code = """
def print_hello(who="World"):
    message = f"Hello {who}"
    print(message)
"""
display(HTML(
    highlight(code, PythonLexer(), HtmlFormatter(full=True, nobackground=True))
))
# use HtmlFormatter(style="stata-dark", full=True, nobackground=True)
# for dark themes

Output to HTML, Latex, image formats.
We use it in MSTICPy for displaying scripts used in attacks. Example:

Extras

Brian:

smart-open
- one of the 3 Gensim dependencies
- It’s for streaming large files, from really anywhere, and looks just like Python’s open().

Michael:

Python 3.10.3 is out.
git fixup (follow up from last week, via Adam Parkin)

Joke: What’s your secret?

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds.

00:04 This is episode 276, recorded March 22nd, 2022. So many twos. I'm Michael Kennedy.

00:13 And I'm Brian Okken.

00:14 And I'm Ian Hellen.

00:15 Hey, Ian. Welcome to the show. It's great to have you here.

00:18 Thank you very much. I've listened to the show a lot and feel very privileged to appear on it.

00:23 It's our privilege to have you here. Thank you so much for listening. And I know you got some

00:29 cool stuff to share. So we're looking forward to hearing about that. Also, I do want to say

00:33 thank you to Fusion Auth for sponsoring the show. I'll tell you more about them later.

00:39 Before we get into the topics, Ian, tell people a quick bit about yourself.

00:42 Sure. I'm a developer in Microsoft, the Microsoft Threat Intelligence Center.

00:48 Been with Microsoft for quite a long time. Only relatively recently, like four years so ago,

00:52 got into Python coding with Jupyter Notebooks. So I work on Jupyter Notebooks for the Microsoft

00:58 Sentinel project and own a modest open source package that we'll call MysticPy, which we'll

01:05 cover a little bit later. Takes most of my time.

01:07 Fantastic. The whole cybersecurity threat detection stuff, it's very interesting. There's a lot of

01:14 innovation there, but it's also, it's a challenging area to be working.

01:17 Yep. Yep. We're never sure of stuff to do.

01:20 Certainly. I'm sure you're not. Well, Brian, how about you kick us off here?

01:25 Well, so I'm going to start off with a problem. So I had a problem and I have a cool solution for it.

01:30 So my problem is on test and code, I've got titles and I want to end a show on it. It's MP3 file,

01:39 but I want to create a show notes, automated show notes or not show notes, a transcript.

01:45 So one of the problems, there's a lot of problems in doing this, trying to automate it, but one of them

01:50 is the title. I want to turn that into something that's a little bit, so something like, you know,

01:57 it's got normal English and capitalization and all sorts of spaces and stuff. I want to turn that into

02:03 things that URLs hate. Yeah. I want to turn that into a URL. And, and one of the problem,

02:09 one of the things is getting rid of stop words. So there's a bunch of stuff like lower casing.

02:13 I can do that easy, but getting rid of stop words was a little hard. So I ran across this,

02:19 this thing called gen sim parsing, pre processing thing. So pre processing. So gen sim is a larger sort

02:28 of beast. It's a, it's a used for machine or machine learning and stuff to generate models. But I am,

02:36 I'm just really using one little piece of it, the pre processing part. And it's, it's really pretty

02:41 cool. I was looking, I actually found this article first. There was an article called removing stop

02:47 words from strings in Python. And it has, it has a discussion of NLTK and gen sim and spaCy. I tried all

02:55 of them out actually. And the one that really stuck best for me is a using, talked about using

03:02 remove stop words is exactly what I wanted right from, from gen sim. So I went ahead and tried that

03:09 and it worked really well, but I'm like, wait, I'm pulling this is in from the pre processing library.

03:15 I wonder what's what else is in there. And there's all sorts of really cool stuff in here. there's a

03:22 lower lowercase to Unicode. It turns it both into lowercase and in Unicode. That's pretty neat.

03:27 Don't think I need it, but that's neat. but then there was one that was, pre, I thought

03:33 maybe this is exactly what I want is, something called pre-process string. And it has a whole bunch

03:38 of filters built into it. Oh, nice. Like strip strip. Yeah. Strip white space, strip punctuation. I love it.

03:45 Yeah. And take away multiple, after it strips punctuation, like you're going to have,

03:49 if I go back, I had a slash in my title for one of the episodes. If it takes that out, I'm going to

03:56 have a space before and a space after. So I want to remove those. So it'll strip multiple white space

04:01 strips out numerics. Cause I probably don't want numbers in there. and then remove stop words.

04:07 The one thing I don't want that I'll have to like, customize how I'm calling this is a stem text.

04:13 So stem text, I didn't, I don't know what that did without playing with it, but what it does is it

04:18 would take things like twisted and turning it, turn it into twist. That's, that's really not right.

04:23 So you definitely don't want that. I don't want that. I don't mess it up, but I think I want

04:26 everything else. So, this gen SIM, library has, you know, if you're doing machine learning,

04:32 coming up with models, I think this is a great, tool to look into, but if it's actually,

04:39 I'm going to use it just for, removing to create these titles for, for, you know, my podcast,

04:45 but the, I think it, it feels a little weird. It feels like I'm using this really big hammer to do

04:51 this little tiny problem. I guess I'm okay with it, but you know, do you have any other ideas

04:57 where it could use or, well, I didn't know about this. So I wrote my own. Okay. And it's, it's,

05:04 it's kind of janky. Like it's a little bit, a little bit recursive iterative. It's like,

05:08 we'll take away all the punctuation. Now turn all of your white spaces into single white spaces.

05:14 Cause there might've been, you know, dot space. So now you've got two white spaces, but you've got to

05:18 take away, you know, there's like a bunch of weird steps and then, then put it back. This looks

05:22 cleaner. It is a dependency, but it does look cleaner. I like this. I think it's, I'm glad I

05:27 know about it. Ian, what do you think? Is it a huge thing? I mean, dependency, but, I always

05:32 think of like ML like stuff, but this is like just the pre-processing, right? Well, I'm actually pulling

05:37 in all of GenSim to get this. I don't know if I can pull in little bits, but, it's, it's not

05:42 really part of my application that I'm shipping. It's just a tool that I'm using on my laptop. So I,

05:49 I guess downloading it once doesn't really bother me too much, even if it's a big thing.

05:53 But cool. Yeah. I was thinking, yeah, that's a good, that's a good point. If it's running

05:56 local, it's like a dev dependency, who cares? Right. It's like worrying about how big pytest

06:00 is. Like it doesn't really matter. And I'm not, well, I kind of get care about that. Cause

06:05 CI is going to pull it in all the time for pytest, but. Yeah, but they got fast networks.

06:10 It's not your bandwidth. It'll be all right. One of the things that struck me about this

06:16 that made me think of your situation is like that lowercase to Unicode in so many

06:21 times in the security space. It's about like, you're checking for this representation, but what

06:27 if there's another representation that means the same thing? Like you don't say go to this

06:31 directory. You say go dot, dot. And then over there, you know, those, those kinds of non-canonical

06:35 representations. I wonder if there's any use of this kind of stuff for you.

06:39 Yeah. There's something I kind of touch on the pigment section later on, which like the

06:43 attackers typically write scripted attacks and try to obfuscate code using a mix of kind of

06:49 uppercase and putting random dots. I'm just thinking that'd be a nice, potentially a nice

06:53 way of kind of cleaning some of that, that stuff up.

06:56 Yeah, for sure. There was a, there's been some interesting supply chain vulnerability stuff.

07:01 Remember, remember the guy with the color and I think the faker stuff in JavaScript that

07:06 sabotaged his, his libraries. There was another one that maybe well-intentioned. I don't know. It,

07:14 it was some open source library. I don't believe it was Python. I can't remember what it was.

07:19 It could have been, but I'm pretty sure it was in JavaScript because that's where all,

07:22 most of the bad stuff was, it seems. Anyway, they wrote their, they, they taught their dependency

07:28 to erase everybody's hard drive who installed it, who was in Belarus and Russia, which, okay,

07:36 maybe they're trying to contribute, but like it ended up doing a bunch of bad things, even to places

07:40 that were like trying to help say people in the press and journalists do certain things and then like,

07:46 you know, connect with sources and in a race like that database as well. And what they did to make it so

07:51 that nobody would notice in the GitHub commit before it went out to NPM was base 64 encode their

07:58 changes. So I basically put a base 64 encoded string and then like decode and then run that.

08:03 And, you know, it's like that kind of stuff. I know this won't solve that problem, but yeah,

08:07 you know, that, that sort of category of like weird representations.

08:10 Yeah. You need mystic pie for something like that. It's one of the things we, yeah, it's a common thing,

08:15 kind of basics before decoding before the obfuscating. But yeah.

08:19 Yeah. Interesting. yeah, I thought of maybe using something like that with,

08:24 because one of the problems we have is like every, every script is kind of slightly different.

08:28 if you could use something like that, essentially kind of apply like sentiment analysis to

08:34 script. I mean, this is a big problem. It's just not something I've particularly solved.

08:38 but that might be a kind of useful, useful thing to just picking out certain things that indicates

08:43 malicious, like format, you know, format drive.

08:47 Exactly. Yeah. You could certainly represent like this one does hard drive stuff. Is this,

08:52 I thought it was parsing colors. Why is it doing things with the hard drive? This is odd,

08:55 you know, like, or with the network, stuff like that. Cool. All right. Well, you know what you

09:00 would really want to check out if you were trying to research these things, probably documentation.

09:04 So I want to tell you all about dev docs, dev docs.io. This is pretty cool. Now, when you get there,

09:09 it's an interesting on my Firefox, it's just got like the mobile view, which is really odd. If you go

09:15 there with a full browser, it's what it believes is the full browser. I guess it's like a slightly

09:20 different view. That's pretty similar, but not the same. So there's, if you open up a whole bunch of

09:26 programming technologies, let's say not just Python or JavaScript or something, but there's also Vue JS.

09:33 There's Vexoig, for example, like some of the foundation of flash and you can pick the particular

09:37 versions and stuff. So you can go in and like enable these different things. So maybe I care about

09:42 view. I can go over here and enable that one. Let's, we definitely want some Python. Let me go find

09:47 some Python and it gives you all the versions. I'll take that. And let's say I'm also working with

09:52 Postgres. So I'll enable that documentation. And then I might be working with engine X for the front

09:57 and which is somewhere right here. So you can go enable that. And then it will be up near the top

10:05 somewhere here. You can see these are either the default ones or the ones that I checked on. So then

10:09 you can open them up and say, I want to go and see the engine X guide about a debugging log. And then it

10:15 takes you to the documentation for that technology. So it's like a meta documentation repository for all

10:22 of these things all at once, which is pretty cool. Right? So I can go up here and search. I want to

10:26 know about like, let's go about like media tags or something. So you can see the stuff in HTML5. You

10:33 can see the stuff in when you say media, it looks like median. So you can see that in the statistics

10:38 module for Python, some stuff for CSS, or you could come over and say, look, I just want to search for

10:44 CSS. And then you get like using media queries and how to do that kind of stuff. So it's kind of a,

10:50 what you do is you turn on the pieces that are relevant to you, and then you can search across

10:54 those technologies. Cool, right? Wow. Yeah. And, and then if you're on the move, you can come over

11:01 here and turn on offline, offline data, and it'll download all of that as an app so that then you're

11:09 the coffee shopper and you're playing, you now have all the documentation for Python 310, Vue.js,

11:13 Verix.soing, Nginx, et cetera, et cetera, that you can use, which is pretty cool. And this is something

11:19 that drives me crazy about Firefox. They had it and they took it away. And I don't understand why,

11:24 because I'm feeling as firebox is about what the web. So they took away the ability to do progressive

11:29 web apps in Firefox, but all the Chromium browsers support it. So you can actually go and install this

11:35 as a dedicated application on your system. So you, if you have no web, you just click that open. It's

11:41 its own window. You can up, you know, alt tab, command tab between it. Super easy. And then turn on the

11:47 offline mode. And you basically have an app that has offline documentation for all the programming

11:52 technologies that you care about. So this is my new coffee shop buddy.

11:56 Is the search go across the thing you've selected then? So if I search for like replace or something,

12:01 it's the things I've selected.

12:03 Yeah. So if you turn on like JavaScript and Python, it would look for that in both languages.

12:08 Oh, okay.

12:08 Yeah. So basically the ones you turn on, there's a ton of them, right? And you pick,

12:12 you say, these are interesting to me and then search and stuff from what I can tell only applies

12:16 to the technologies you say you care about. Cause like if, if you don't use Java, you really don't

12:21 want to see the documentation for Java search, right? That would be useless.

12:23 Yeah. One of the things I like about this is it also has versions. So, if you're using a,

12:28 like an older version of Postgres, you can just enable that version.

12:33 Right. Sometimes it doesn't matter very much, but other times it matters massively like bootstrap

12:38 three and bootstrap five, they're like fully incompatible basically. Like they're totally

12:42 different keywords and grid systems. And you don't want just the latest. If you've got an old app

12:47 you're working on something like that. Python's more forgiving about that kind of stuff, right?

12:51 It doesn't break as often.

12:52 I was amused that the list though is, it has like three, nine, three, eight for Python

12:59 and it has three 10 at the bottom because one is obviously.

13:03 Cause it's alphabetically sorted. How interesting. Ian, what do you think of this?

13:07 That's very cool. I'm amazed. Is somebody at dev docs kind of manually maintaining all of the links

13:13 to these, like the original source documentation?

13:17 Yeah. Where are they getting it from? Right. I mean, cause there's, they're super disparate.

13:21 It's like matplotlib and markdown and MariaDB. These are all, it's unlikely they're all stored

13:26 in the same basic system. Right. I don't know how they get them actually.

13:29 Yeah. That's very cool. I mean, I know, I normally have solved the same problem by having like 130

13:34 tabs open to different bits of Python docs and pandas and.

13:38 Exactly. Exactly. Yeah. I'm pretty sure they got pandas in here.

13:42 They got numpy as its own thing that we saw matplotlib. There's pandas and there's even,

13:47 you know, versions of pandas across there.

13:50 Single tab solution. Brilliant.

13:53 Yeah. It looks, looks pretty good to me. All right. You want to tell us about what you got

13:57 for your first item?

13:58 Okay. Sure. Yeah. so, as I mentioned earlier, I own a package called mystic pie.

14:06 and first thing to sort out with it is the spelling because I suffer from this on a daily basis,

14:10 mistyping it, even though I've owned it for like three or four years. So it's MSTIC standard for

14:17 Microsoft threat intelligence center. There's no why or anything like that in there. So it's a tool

14:21 set for cybersecurity investigations and hunting in Python, mainly in Jupyter notebooks. So there are a

14:29 couple of questions to ask about that. So firstly, what is cyber security hunting and investigation and

14:35 what it, why are Jupyter notebooks useful? So the first one, cyber sec investigation is really responding

14:41 to alerts or other kinds of threat intelligence and trawling through typically large amounts of

14:47 security logs from cloud services, hosts, account services to determine whether this is a real threat

14:53 or not. And there are two main kinds of... That's one of the huge problems, right? Is you've got all

14:57 these different systems. How are you going to know if someone, if you don't have a tool like this,

15:02 how are you going to know that something, someone's in there rooting around, right?

15:06 Yeah. Yeah. And there are a couple of things that usually trigger this kind of search. So one of them

15:11 is a, an alert may be coming from your seam and that's a, that stands for security, information,

15:17 event management. So the, like a console, like, ArcSight is a traditional one or Microsoft Sentinel

15:24 is a cloud-based one. so you get an alert based on a rule and you need to go in a fairly managed process.

15:30 Somebody needs to go and investigate. Is this a real threat or is this just noise? or there might be something

15:35 like the solar winds, they never a year ago, the log four J, like something in the press or something

15:42 from a threat Intel kind of alert says this kind of threat is around and that's a more ad hoc process kind of hunting.

15:49 Like, do we see this in our organization? so that's kind of what mystic pie is trying to, you know,

15:55 try to address the needs of that. and the second question is why Jupyter notebooks? Why would you do any Jupyter

16:00 notebook rather than in your existing sock tools? I mean, I think there's a lot in common, this kind of

16:08 activity is a lot in common with like big science data, sorry, big, big data science. I mean, something like

16:15 astronomy where you're kind of, you know, hunting for an adversary activity is a little bit like trying to find an exoplanet

16:21 in kind of gigabytes of data or a new quasar or something like that.

16:25 a hundred thousand stars or a hundred thousand lines of log file and you're hunting for some patterns and stuff.

16:31 Right. And you've got a few photons you're trying to determine are these kind of different, you know,

16:35 something like, like, an adversary activity is a little bit like that. It's like millions and millions of events

16:39 and you're trying to find the bad stuff. so traditional sock tools are kind of, you know, can be really excellent.

16:45 And I work with one that I think is, is really good, but, but they all have limitations.

16:50 What's a, a sock tool, a sock tool, a sock security operations center. So, so something like, you

16:56 know, a console that fires alerts and tells you that they have a bunch of analysts, engineers looking

17:03 at this output of this and deciding, and that's the trigger for their investigations. They're like,

17:08 is it like failed log in the SQL server?

17:10 Yeah. Something like that. Or, you know, it could be more sophisticated thing. Like,

17:15 something's exit, you know, tried to access the kind of password data on this, or looks like it's trying

17:20 to access the password data on this host or, or has made a weird kind of configuration change to, mailbox settings.

17:28 So all those kinds of things can kind of trigger alerts and investigations. but you are limited

17:34 in most kind of operation center environments. Notebooks allow you to kind of break out of some

17:39 of the constraints of that. So firstly, you can get data from anywhere. you're not just limited by

17:45 kind of what's in your logs. You could go to virus total or so you can bring data from anywhere.

17:50 you can use customized kind of analysis. so write your own or get, get things from PyPI. Lots

17:57 of people have kind of written this stuff. you control the workflow. So, so you don't have to follow

18:02 what the tool says. You can reorder things, you can backtrack, redo things, and the workflow is repeatable.

18:08 So if you get a similar kind of, you know, issue again, or similar kind of alert, you can

18:14 fish out an old notebook and rerun the same kind of analysis. And you end up with a nice kind of

18:19 shareable document that, it describes your investigation a bit like the results of a

18:25 scientific investigation. It's like, here are all the steps I took and these are the results.

18:29 And this is what they, this is what we determined to be the bad, you know, the bad activity.

18:33 Right. The other thing that seems useful here is Jupyter. Often the notebooks will save the last

18:40 bit of computed information. And then you can go, you know, change a cell, ask the question again,

18:45 change without rerunning the whole thing. And like that's parsing tons of logs or

18:49 pulling them over SSH or whatever that not doing that again is nice.

18:53 Yeah. And it's brilliant. If you don't like doing lots of queries in different browser tabs

18:57 and your browser crashes, they've all gone. What do you do?

19:01 It's all in a Jupyter notebook. I say, it's like second by second, after you do it,

19:06 you can just go back and you can go back to things like you may have done months ago.

19:09 So, yeah, absolutely.

19:11 Yeah. So, so when I started all of this, I kind of thought a lot of this stuff for cyber

19:16 investigations would be available on, and PyPI. I thought great Jupyter notebooks seem like brilliant.

19:21 And there's going to be process tree viewer and there's going to be an event timeline and all this

19:25 kind of stuff. and I found out there wasn't, at least I couldn't find it. so I decided to

19:32 just like stop everything. Need to start writing this, this stuff. So it turns out that things like

19:37 visualizations you need for detecting exoplanets are a bit different from ones you need to detect,

19:42 bad actors. So, so we started building this thing originally me, but there's now, Pete

19:48 Brian and Ashwin Patil also kind of, working on it to my colleagues and a bunch of people in the

19:55 community. It's got four main functional sections. It's like data querying, how you get data in,

20:01 how you do templated queries as enrichment. So for example, if you have something like an IP address,

20:07 you might have a bunch of questions about it as an analyst, like which geographical location is this IP

20:13 address from does it, or any malware reports about it. third areas analysis are things like

20:19 anomaly identification, like the thing you've talking about a spike in, in failed logon events,

20:25 unusual spike in failed logon events, that kind of thing. the final area is visualizations,

20:29 and these are like more specialized. I've got kind of a couple of examples in the show notes.

20:34 this is like anomaly identification pattern. This is one of, one of the custom, we use Bokeh,

20:40 which I really, this is really nice kind of visualization package, to allow you to kind

20:45 of view data in a way that analyst kind of expects you to s to see it a bit. So they're more of this

20:51 kind of visualization than more traditional kind of graphs. I would much rather look at this than

20:56 log files or event logs or, or whatever, you know? Yeah. That's the whole thing about, you know, you,

21:00 you, you need, you may have thousands of events and you need to get down to the few that are the

21:04 interesting, the interesting thing. so one of the areas that we've, we try to focus on

21:10 currently, cause we wrote all this stuff and you have like hundreds of functions that you could use,

21:15 but it's kind of difficult to discover them. And they all, cause they evolved a little bit organically.

21:21 Like how do you, they were working a little bit of a different way, different set of parameters.

21:26 So the work we're currently doing is trying to make this all a bit more accessible. So all of the

21:31 functions that relate to say an IP address, all the questions you want to ask about it are kind

21:36 of dynamically attached to a class called IP address. So they're all like things like,

21:41 Oh, interesting. Do, do, do. So you don't have to work just with a raw string or just some raw IP representation, but you can ask it questions like its location.

21:49 Well, it's not quite that intelligent. It's even a bit less intelligent than Alexa, but, but it's,

21:55 but it's more like, you know, there might be things like geolocation of an IP address,

22:01 threat intel lookups, different queries that might be, have IP addresses like a, a parameter.

22:07 and previously you'd have to go and find all of these things and import them separately and run

22:12 them. but now they're all kind of dynamically attached as methods to the fact that use IP address

22:17 as a parameter means that you just have one object to import, and then you can do all of these different

22:22 operations, on this single item. there's, there's some things that don't work with that.

22:27 Some things like the visualizations, for example, they're not IP address or host or account specific.

22:33 They work on big blocks of data. So the other area we're working on is try to anything. It takes a

22:39 bunch of data as an input. We're writing those as pandas excesses. so they appear as methods to a

22:46 data frame. So you do kind of data frame dot MP plot dot timeline, right? And it would produce your

22:53 timeline as long as it's the right kind of data or, so yeah, that's one of the challenges of

22:57 writing this kind of thing organically is you end up with a lot of stuff, but nobody knows it's there

23:02 and everybody knows how to import it. So try to make it as accessible so that it just becomes a very

23:07 intuitive thing. Oh, I have an IP address. What functions can I do? I can do this, you know,

23:12 it's all like tab completable, that kind of thing.

23:14 Yeah, I think it's really cool. You've taken this Python data stack view of cyber security and threat

23:21 detection. Yeah.

23:22 Yeah. Brian, what do you think?

23:23 well, it's definitely a complicated area. and it trying to, one of the things I like about this

23:29 story is just talking about the complexities in API design, and discoverability that's a,

23:36 that applies to like lots of different fields, but yeah.

23:39 Yeah. It's one of those things you should have thought about at the beginning, but,

23:42 even at the end, you can tidying things up. yeah. So, Famous last word.

23:49 So yeah, we're definitely open for like other people collaborating, contributing stuff,

23:55 cause there's a lot of ground to cover.

23:56 yeah, for sure. It's on GitHub. I saw one final question before we move on. Is it just for Azure

24:04 or is, is this a thing that more broadly works across different systems?

24:08 No, I think I should have mentioned that a little bit earlier on it. We originally built

24:11 it for Microsoft Sentinel notebooks, but it supports like Splunk, Defender,

24:16 working on an elastic provider. So really anything you can get into a pandas data frame,

24:21 you can use most of the functionality. So even if we don't, we don't have a provider ourselves,

24:26 if you've got something like PySpark and you can get a data frame, then all of our functions take

24:32 data frame. You know, we use pandas as our universal data interchange format.

24:37 Yeah, indeed. Indeed. Kim Van Wick out in the audience likes it. It's way like a much nicer way

24:44 to glean info and logs and complex grip. I'm, I'm right there with you. All right. Now, before we

24:49 move on, Brian, let me tell you about our sponsor for this episode. This episode of Python Bytes is

24:55 brought to you by Fusion Auth. Fusion Auth is an authentication and authorization platform built by

25:01 devs for devs. It solves the problem of building essential user security without adding risk or

25:07 distracting from the primary application. Fusion Auth has all the features you need with great support and

25:12 a price that won't break the bank. And you can either self-host it or get the fully managed solution

25:18 hosted in any AWS region. Do you have a side project that needs custom login and registration,

25:23 multi-factor authentication, social logins, or user management? Download Fusion Auth community

25:29 edition for free. The best part is you get unlimited users and there's no credit card or subscription

25:35 required. Learn more and get started at pythonbytes.fm/fusionauth. The links in your show notes.

25:41 Thank you to Fusion Auth for supporting the show. All right. What do you got for your next one,

25:45 Brian? Number, numbers, something every computer scientist should know?

25:49 Yes. Floating point. Arithmetic is complicated. And so when I started, started working in

25:56 professionally, one of the things I was recommended reading was, an article called what every

26:01 computer scientist should know about floating point arithmetic. And don't worry, it's only like a

26:06 really long paper with lots of math. so I am not telling you to read this, although it is an

26:13 interesting read. What I would like you to read is this article by David Amos called the right way to

26:19 compare floats in Python, because there's a few things that we need to know about floats when we're

26:24 using them and floating points is, and he covers all of this in the article without going through

26:30 tons of scary math is the floating point numbers have to be represented in a way that can the computer

26:37 can store them and use them and manipulate them, even though some numbers are huge and won't fit

26:43 normally. So we have to do things like accept that there's error and rounding. So there's a little bit

26:49 of a discussion there that he talks about. One of the things that surprises people sometimes when they

26:54 first come come into Python, but it's not just Python, it's most, most languages is somewhere.

27:00 There's going to be something obvious that doesn't work like in, in Andy or David's example, 0.1

27:06 plus 0.2 equals or comparison equals, 0.3. And that will show up as false because they don't.

27:14 And this is weird. They obviously are crazy that that doesn't work, but, but it's not just equals.

27:21 You can also do comparisons like, you know, less than or greater than. So it's not only is that,

27:27 are they not equal? They're not like 0.1 plus 0.2 is not even less than or equal to 0.3. It's weird.

27:34 so, so what do you do? You don't, the gist of it is don't compare things with a normal

27:42 math comparisons if there's floating points involved. So what you want to do instead is, and there's,

27:49 here's a little tiny bit of math, way less than the, than the example. the thesis,

27:54 the dissertation. Yeah. so there's a whole bunch of stuff built into Python that you can,

27:59 to, to, to work with comparisons. And one of the most common ones I'm trying to get there

28:05 is, math is close. So there's a math library that's, it's that with an is close function

28:11 that it's used to just say, Hey, I've got two values. Are these close, close enough? and,

28:18 we, when, if you're using, if you have to compare floats, something like this is, is great. And be

28:25 underneath the scene behind the scenes, what it does is it's, it's taking the two values and

28:29 subtracting them and figuring out if the Delta is, or the absolute value of the Delta is below some

28:36 tolerance, some reasonable tolerance, like close enough. And what that tolerance is,

28:41 is either a relative or absolute tolerance. And, you, most of the time you can kind of get away

28:47 with not caring about that, but if you do care about it, you can control that you can pass in

28:52 what tolerance you expect things to be closer to. I use stuff like this all the time with,

28:57 with test equipment, because I definitely want to know, control over the tolerance levels.

29:02 So, yeah, for sure. So there's math is close, but then there's also, I'm not going to

29:08 scroll all the way down here, but there's, there's, he also covers numpy. So numpy has got a

29:13 couple of these that are really great. One of them is, is, is close also, but it works on arrays and

29:19 it'll give you an array of, true and false values, but you can also use all close,

29:25 which just says you've got two arrays. And if all of the pairs are close enough,

29:30 it'll match those up. also covered, which, we use during testing a lot is py test prox,

29:37 which is a little bit of a different beast, but, but David covers that. So, basically this

29:43 is a semi regular reminder to anybody using floating point math in Python that you should be careful

29:49 with it or any other language. So. Yeah. It's not a Python thing. It's just a fit representing

29:55 things that don't fit. Now there's some things sometimes where you have to be very exact.

29:59 You need to be very precise. And in those cases, Python does have the decimal and fraction types.

30:05 and David covers these in the article, which are cool. They're cool things to know about,

30:10 like definitely around, people using money or, or other, very high precision. But if you're

30:17 also, so there's, those are covered. They do get some sort of a hit for those. But if you really care about,

30:23 like the precision and want to want to do things exactly right, then you probably should read that

30:29 larger article because there's things that you have to do like, certain operations before

30:34 other operations to try to keep the area error from accumulating too high. So there's, it gets messy.

30:39 Interesting.

30:40 I think I'm fundamentally disturbed by the idea that zero isn't zero. So my approach to floating

30:45 point numbers is normally convert them to ints. Yeah. I was thinking that, yeah, sometimes that is

30:53 the way to do it. Right. I was thinking this kind of stuff maybe applies a lot to the project that

30:59 you're working on. If you're trying to come up with ratios that represent, you know, how risky something

31:05 is and things like that. Yeah. Yeah. Yeah. I mean, certainly a lot of, yeah, I was being a bit

31:10 flippant before. It's just as fun. It's like, I'm a very platonic at heart. I think so. Like zeroed

31:17 one should be zero one, not nearly one of nearly zero. There should be a perfect square in a perfect

31:22 circle. Like how can they not exist in our language? Is it really zero or negative zero?

31:27 Henry on the audience. Henry also points out that PI test approximate also works on numpy arrays as

31:37 well. Nice.

31:38 Which is pretty cool.

31:39 Cool.

31:39 You can put that all together. All right. Let me tell you all about Piper. I think that's,

31:45 that might be the representation, the way you pronounce it. Everything needs its own description,

31:49 its own like little phonetic bit. So this, this is a, a simple way to create scripts that run and do

31:57 stuff on your computer using Python. And what's cool about it is it has a real simple way to define

32:02 the steps. Some of those steps can be optional, but then you can also piece together things like

32:07 other programming. So you can combine commands, different scripts in different languages and

32:13 applications all into one sequence of events that happens on your computer. So it's basically a task

32:20 runner where you define stuff in YAML. And probably the best way to see is to go check out the docs. And

32:25 there's a whole bunch of docs. The docs are really nice here actually. So for example,

32:30 if you go to getting started and come down here and run your first pipeline, I really like the way

32:35 the docs here look, how they look, but the way you define it, here's like a one, one step one is you

32:41 just say the steps and it's all YAML and give a step a name so you can refer to it. And then you have inputs

32:47 and outputs and outputs and you do the little curly string interpolation types of things. Or you can

32:51 have more complex ones like with different steps and you can even have little comments. There's a way to

32:57 put a comment in your YAML file as well. So there's also conditional. Let's see if I can find a good

33:03 conditional one down here. Here's on it goes and works with like, this one is just an echo

33:09 statement and the ping command, but you know, whatever, whatever you want to do, you can basically

33:14 pass command line arguments to the YAML file or to the workflow, the pipeline, and it'll take those and

33:21 feed them into the steps. So for example, when you call it, you can say like count equals one and IP

33:27 equals that. And those will come the little string interpolated pieces that go in there. So you can

33:32 just combine whatever, basically whatever commands are available to the shell, right? Be that Python or

33:38 POSIX or windows or PowerShell or whatever you're looking to do. Pretty cool, huh?

33:42 Hmm. That's pretty neat. I might need this for my, my job of, automating my show notes.

33:49 I might use some of this.

33:50 Oh yeah, there you go. If you can find this, go do that. And so on, like, here's one

33:55 that sort of uses the truthiness. So it says there's a bunch of different steps and the,

33:59 you can use the run flag. So here it says run if there's a value for a on this one. And this one

34:06 says run if there's a value for B. And then there's an example where it says, okay, we run it by itself.

34:10 Those don't run. But if you pass a, then it runs that a step. If you pass B, it does the B step,

34:15 or it can do both if you pass them both. And I like the simplicity of it. Like a lot of these tools,

34:20 like this feel like they're pretty complicated.

34:22 You know, you're sort of like your example with the Genshin, Brian, where you're like,

34:26 is this thing too heavy weight for what I'm trying to ask it to do? You know? And this seems like a

34:31 real simple thing. And I don't have to learn about make or any of those kinds of things.

34:34 Yeah. GitHub actions or, yeah. Yeah. Yeah. It's got a bit of a GitHub actions feel to it.

34:40 That's, but it seems like a nicer kind of declarative. That's really cool.

34:45 Indeed. Yeah. If you were not, not into programming or you didn't want your steps to be programming,

34:49 but of course what happens at each step, you could call a Python app or script. That's going to do

34:55 something complicated, right? If it needs to, can you, can you, the orchestration of that,

34:59 you don't have to make complicated. Is it just a command line too? Or can you invoke it from Python?

35:03 Might be a bit interesting. I'm sure there's, there's a way to import it and make it do, do a thing. You

35:09 know, it's probably just a Python package with an entry point in this package. So I would think so.

35:14 Yeah. Cause it would be nice to be able to do that rather than just using sub process to invoke a lot

35:18 of things. Like if you're in. Oh, interesting. I hadn't really thought about it as a replacement for

35:23 sub process, but yeah, because a lot of times when you're trying to orchestrate stuff, like it talks

35:28 about here being part of the shell or being another app or another language, you would just use sub process

35:34 on it. Right. Yeah. Cool. Well, there it is. Piper, Piper.io and people can check that out. It looks,

35:40 looks pretty interesting. Nice. All right. Ian, you want to take us out with your final item here?

35:44 Ah, pigments. Okay. So this is a package. I mean, if you were a developer, there's a very good chance

35:49 that you have been using this for years without, like me, without knowing about it. You might have

35:54 seen it being installed as like a dependency. It's like, what is that thing? That was my thought,

35:58 Ian. I'm like, I know I see this all the time in my dependencies and I just never really bothered to

36:03 look into what it does. Yeah. So I hadn't until recently. So if you use, if you use Jupyter Notebook

36:09 markdown, you know, you can look like three backticks and, and then a block of code. And you can actually

36:16 put like Python or bash or something as a, and it will intelligently highlight it. So the thing that's

36:23 doing that intelligent highlighting is pigments, GitHub markdown, same kind of thing. Although I'm not

36:28 sure whether GitHub uses pigments. And if you do developer docs, like reader docs and Sphinx,

36:34 that also uses pigments to kind of color code your, your code samples. And I know there's a lot of,

36:41 you know, writing kind of blog posts and stuff like that. You, there are some,

36:45 quite a few services out there where you can take a chunk of code and it will, intelligently

36:50 highlight it and give you a, a JPEG or a PNG back. And that's kind of nice, but then you can't copy

36:56 and paste the code from those samples. So I don't like that really. I think if you're going to put

37:00 code in a article, you, you're probably intended for people to be able to copy and paste it.

37:05 Yeah. That's the most likely thing you are to copy and paste.

37:08 Yeah.

37:08 Yeah. Right. Cause you want that code over here.

37:10 Yeah. You don't want an image of your, I mean, cause you could use OCR to like reinterpret it,

37:13 but it's all, yeah. And then maybe, maybe Brian's gen sim to like, tidy it up.

37:19 but, so with pigments, you can use it as a standalone package and it can do this kind of

37:27 rendering, and it can render to like HTML with like CSS style sheets for all of the coding. It also

37:33 rendered to like NC terminal, latex, a few other, other kinds of things. So if you're using,

37:40 you know, if you want to get a nicely formatted piece of code in, in a document or you're doing

37:45 developer docs, it's certainly kind of useful. I mean, I came across it. or should I just say

37:50 one thing that also supports, maybe I can just switch supports lots and lots of languages. So it's,

37:55 very simple to use. It has a highlight function. and then you import Alexa, which is like the

38:02 thing that understands the tokens in a language and the, a formatter for the output type you want.

38:08 And I think there's hundreds of these things. So, and, and, and there are a lot of languages in there.

38:12 No kidding.

38:13 I'm more than half of these I've never heard of. And it also supports as well as things like,

38:17 you know, you'd expect Python, it supports Python tracebacks. So it has separate Lexer for color

38:22 coding tracebacks. all the usual languages you'd expect, but also some things like data formats,

38:28 like, Toml, Jason, XML. okay. Interesting. Like a lot of the files that we might run across.

38:36 Yeah.

38:37 Yeah.

38:37 Yeah. and so it's very, very easy to use. And the reason I came across it is because I,

38:44 it recently, so a lot of attacker code tends to be a deliberately obfuscated. So it's kind of base

38:51 64 encoded, but then even once you decode it, it's kind of munged in a way to make it as unreadable as

38:57 possible. So one of the things that we try to do is, is pull that code back, like decode it, trying to re

39:03 like clean it, deobfuscate it. but if you have, if you can present it in a, as close to the way a

39:09 developer would write it as possible, it makes it much quicker for an analyst to determine what,

39:14 what is this doing? so we've used it now in, in mystic pie to kind of, color display things like,

39:21 well, it's just power shell script or, bash or something like that. So that's how I came across it.

39:26 Actually, rather than just seeing it go past as part of a pip install, actually have to invoke it

39:32 directly. So, so I kind of big shout out to the developers and maintainers of pigments.

39:38 It's one of those package that probably millions of people benefit from, but like very few people

39:43 kind of know about it or, you know, you can, and it's just super easy to use. They seem to be adding

39:48 kind of flexors all the time. So, great. Yeah, this is amazing. I didn't realize that it did all

39:54 of this. This is a way more advanced than I thought. Brian, did you know? No, I just thought it was

40:00 something that magically syntax did syntax highlighting. So I didn't have to care about it.

40:04 Yeah, exactly. I got a little example in the, in the show notes as well. I posted it has a dark theme.

40:13 Yeah. Yeah. yeah. And you, you probably want to include this no background equals true

40:18 if using a Jupyte Notebooks. Cause if, if you select a theme, it just flips the whole notebooks kind of

40:23 CSS theme. So that tells it just not to mess with what, what's in the background. Okay. yeah,

40:29 that looks great. Yeah. Thanks. Thanks for pointing out how useful that can be. That's, that's cool.

40:34 Like I said, I've seen it go by all the time. I just never really paid that much attention to it.

40:39 It's probably a pretty minority use, but like if you need it, it's great.

40:42 Yeah. It's incredibly powerful. Fantastic. Well, that's all of our main items. Brian,

40:46 you got any extras? just one extra, actually. One of the things when I was doing that, the

40:51 first topic with GenSim, the, one of the dependent, it doesn't have very many dependencies,

40:57 but one of the dependencies is this, this library called smart open. And I'm like, what? I,

41:03 I open things and I want to be smart about it. So I wanted to check this out and it's pretty neat.

41:09 I don't know if we've covered this before, but it's a, it basically mimics the interface of open

41:15 normal Python open, but you can pass it really anything in. And it does, like,

41:22 transparent on the fly reading of things, efficient streaming of large files from like S3 or Azure

41:29 or, or over the web.

41:31 Even straight just HTTP. Yeah. If you just have a link to a large file on a web server.

41:35 Yeah. And, and then just the code for it is just like super nice. You know, you, you import open

41:41 from smart open and you got like four line in open this thing and, just, you can work from each

41:49 line there. It's pretty cool.

41:51 I love it. That's a, that's a great one. Very nice. Ian, you got any extras you want to

41:56 shout out while we're here? I don't, I'm afraid.

41:59 I have, I have, I have two real quick ones, to just quickly talk about. Last time,

42:07 Emily Morehouse spoke about using auto squash, which was really cool. So Adam,

42:14 let me get the attribution correct here. Adam Park Parkin sent in a follow-up to say,

42:20 hey, you should check out this article over here called fixing commits with

42:25 git commit --fix up and git rebase --auto squash.

42:29 Woo. The long and the short of it is talks about doing a lot of things that Emily said was pretty

42:34 cool, but in the end setting up your.git config to auto squash equals true, and then adding an alias.

42:42 So you can just type git space fix up. And when you type that, it actually does get log and shows

42:47 the last 50 items and then allows you to go back and work with those. And basically it's just a real

42:54 quick way to get back into the scenario where you mark different elements for fix up. So people can

43:00 check that out if they were following Emily's advice, but they want it to be like one line. They

43:05 don't have to remember. There you go. That's cool. And then Python 310.3 is out as of about a week

43:11 ago, I suppose. So there are many changes amongst here. You know, I would love, there's like so many

43:17 great changes here. I don't know how many do you think that is probably a hundred, maybe a little

43:22 bit less. It would be great if there was like a, these are critically important at the front. Like

43:28 there's a security problem that was fixed, or there's a thing we've taken out is no longer here.

43:33 They're kind of all the same priority. But nonetheless, there's a bunch of changes that

43:37 people can check out and upgrade to the newer version of Python 310.

43:41 different people care about different stuff though. I know. I don't want to impose my importance on

43:47 other people's importance. Yeah. So it's funny when I first came across, first came across Python,

43:52 you were kind of like, why is it so slow between the major versions coming out? But then suddenly

43:57 it's like a Python developer. It's like, why are the versions coming out so quickly?

44:00 Yeah. It's definitely true. There's a ton of change. This is just, you know, some minor version

44:08 change that has these, all these changes in here, which is pretty cool.

44:11 Well, we also used to be on an 18 month cycle and now we're on a yearly cycle. So just yeah.

44:16 Yeah. Lucas Schlinger's fault that we are 50% faster now. Thanks Lucas. All right. How about a joke to

44:23 close out the show? That'd be great.

44:24 Yeah. So here's a good tweet and it's this sort of perplexed, I think in a good way,

44:32 character wearing all these, are these prizes? I don't know. Anyway, Python developers, when someone

44:38 asks what their secret is, and this person just says, I just keep writing pseudocode and it just keeps

44:44 working. It's a little bit like that joke where they have some code, pseudocode in a text file.

44:50 They're like, just rename it to .py and try to run and see what happens. Anyway, that's the joke.

44:55 Nice.

44:56 Thank you, Brian, as always. And Ian, thanks for being part of the show.

44:59 Thank you. Great to have you here.

45:00 Thank you very much both. It's been a real pleasure.

45:02 Yeah, it sure has. See y'all.

Want to go deeper? Check our projects

Course: Python for the Absolute Beginner course

Beginners

HTMX + Flask

FastAPI

pytest book

Full transcript