Brought to you by Michael and Brian - take a Talk Python course or get Brian's pytest book

Episode #276: Tracking cyber intruders with Jupyter and Python

Published Wed, Mar 23, 2022, recorded Tue, Mar 22, 2022.

Watch the live stream:

About the show

Sponsored by FusionAuth:

Special guest: Ian Hellen

Brian #1: gensim.parsing.preprocessing

  • Problem I’m working on
    • Turn a blog title into a possible url
      • example: “Twisted and Testing Event Driven / Asynchronous Applications - Glyph”
      • would like, perhaps: “twisted-testing-event-driven-asynchrounous-applications”
  • Sub-problem: remove stop words ← this is the hard part
  • I started with an article called Removing Stop Words from Strings in Python
    • It covered how to do this with NLTK, Gensim, and SpaCy
    • I was most successful with remove_stopwords() from Gensim
      • from gensim.parsing.preprocessing import remove_stopwords
      • It’s part of a gensim.parsing.preprocessing package
  • I wonder what’s all in there?
    • a treasure trove
    • gensim.parsing.preprocessing.preprocess_string is one
    • this function applies filters to a string, with the defaults almost being just what I want:
      • strip_tags()
      • strip_punctuation()
      • strip_multiple_whitespaces()
      • strip_numeric()
      • remove_stopwords()
      • strip_short()
      • stem_text() ← I think I want everything except this
        • this one turns “Twisted” into “Twist”, not good.
  • There’s lots of other text processing goodies in there also.
  • Oh, yeah, and Gensim is also cool.
    • topic modeling for training semantic NLP models
  • So, I think I found a really big hammer for my little problem.
    • But I’m good with that

Michael #2: DevDocs

  • via Loic Thomson
  • Gather and search a bunch of technology docs together at once
  • For example: Python + Flask + JavaScript + Vue + CSS
  • Has an offline mode for laptops / tablets
  • Installs as a PWA (sadly not on Firefox)

Ian #3: MSTICPy

  • MSTICPy is toolset for CyberSecurity investigations and hunting in Jupyter notebooks.
  • What is CyberSec hunting/investigating? - responding to security alerts and threat intelligence reports, trawling through security logs from cloud services and hosts to determine if it’s a real threat or not.
  • Why Jupyter notebooks?
    • SOC (Security Ops Center) tools can be excellent but all have limitations
    • You can get data from anywhere
    • Use custom analysis and visualizations
    • Control the workflow…. workflow is repeatable
  • Open source pkg - created originally to support MS Sentinel Notebooks but now supports lots of providers. When I start this 3+ yrs ago I thought a lot this would be in PyPI - but no 😞
  • MSTICPy has 4 main functional areas:
    • Data querying - import log data (Sentinel, Splunk, MS Defender, others…working on Elastic Search)
    • Enrichment - is this IP Address or domain known to be malicious?
    • Analysis - extract more info from data, identify anomalies (simple example - spike in logon failures)
    • Visualization - more specialized than traditional graphs - timelines, process trees.
  • All components use pandas, Bokeh for visualizations
  • Current focus on usability, discovery of functionality and being able to chain
  • Always looking for collaborators and contributors - code, docs, queries, critiques

Time series analysis for identifying anomalies

Process tree visualizer

Threat intelligence browser

Brian #4: The Right Way To Compare Floats in Python

  • David Amos
  • Definitely an easier read than the classic What Every Computer Scientist Should Know About Floating-Point Arithmetic
    • What many of us remember
      • floating point numbers aren’t exact due to representation limitations and rounding error,
      • errors can accumulate
      • comparison is tricky
  • Be careful when comparing floating point numbers, even simple comparisons, like: >>> 0.1 + 0.2 == 0.3 False >>> 0.1 + 0.2 <= 0.3 False
  • David has a short but nice introduction to the problems of representation and rounding.
  • Three reasons for rounding
    • more significant digits than floating point allows
    • irrational numbers
    • rational but non-terminating
  • So how do you compare:
    • math.isclose()
      • be aware of rel_tol and abs_tol and when to use each.
    • numpy.allclose(), returns a boolean comparing two arrays
    • numpy.isclose(), returns an array of booleans
    • pytest.approx(), used a bit differently
      • 0.1 + 0.2 == pytest.approx(0.3)
      • Also allows rel and abs comparisons
  • Discussion of Decimal and Fraction types
    • And the memory and speed hit you take on when using them.

Michael #5: Pypyr

  • Task runner for automation pipelines
  • For when your shell scripts get out of hand. Less tricky than makefile.
  • Script sequential task workflow steps in yaml
  • Conditional execution, loops, error handling & retries
  • Have a look at the getting started.

Ian #6: Pygments

  • Python package that’s useful for anyone who wants to display code
    • Jupyter notebook Markdown and GitHub markdown let you display code with syntax highlighting. (Jupyter uses Pygments behind the scenes to do this.)
    • There are tools that convert code to image format (PNG, JPG, etc) but you lose the ability to copy/paste the code
  • Pygments can intelligently render syntax-highlighted code to HTML (and other formats)
  • Applications:
    • Documentation (used by Sphinx/ReadtheDocs) - render code to HTML + CSS
    • Displaying code snippets dynamically in readable form
  • Lots (maybe 100s) of code lexers - Python (code, traceback), Bash, C, JS, CSS, HTML, also config and data formats like TOML, JSON, XML
  • Easy to use - 3 lines of code - example:
from IPython.display import display, HTML
from pygments import highlight
from pygments.lexers import PythonLexer
from pygments.formatters import HtmlFormatter

code = """
def print_hello(who="World"):
    message = f"Hello {who}"
    highlight(code, PythonLexer(), HtmlFormatter(full=True, nobackground=True))
# use HtmlFormatter(style="stata-dark", full=True, nobackground=True)
# for dark themes

  • Output to HTML, Latex, image formats.
  • We use it in MSTICPy for displaying scripts used in attacks. Example:



  • smart-open
    • one of the 3 Gensim dependencies
    • It’s for streaming large files, from really anywhere, and looks just like Python’s open().


Joke: What’s your secret?

Want to go deeper? Check our projects