Brought to you by Michael and Brian - take a Talk Python course or get Brian's pytest book

#220: What, why, and where of friendly errors in Python

Published Thu, Feb 11, 2021, recorded Thu, Feb 11, 2021

Sponsored by Datadog: pythonbytes.fm/datadog

Special guest: Hannah Stepanek

Watch this episode on YouTube
Play on YouTube
Watch the live stream replay

Michael #1: We Downloaded 10,000,000 Jupyter Notebooks From Github – This Is What We Learned

  • by Alena Guzharina from JetBrains
  • Used the hundreds of thousands of publicly accessible repos on GitHub to learn more about the current state of data science. I think it’s inspired by work showcased here on Talk Python.
  • 2 years ago there were 1,230,000 Jupyter Notebooks published on GitHub. By October 2020 this number had grown 8 times, and we were able to download 9,720,000 notebooks. 8x growth.
  • Despite the rapid growth in popularity of R and Julia in recent years, Python still remains the most commonly used language for writing code in Jupyter Notebooks by an enormous margin.
  • Python 2 went from 53% → 11% in the last two years.
  • Interesting graphs about package usage
  • Not all notebooks are story telling with code: 50% of notebooks contain fewer than 4 Markdown cells and more than 66 code cells.
  • Although there are some outliers, like notebooks with more than 25,000 code lines, 95% of the notebooks contain less than 465 lines of code.

Brian #2: pytest-pythonpath

  • plugin for adding to the PYTHONPATH from the pytests.ini file before tests run
  • Mentioned briefly in episode 62 as a temporary stopgap until you set up a proper package install for your code. (cringing at my arrogance).
  • Lots of projects are NOT packages. For example, applications.
  • I’ve been working with more and more people to get started on testing and the first thing that often comes up is “My tests can’t see my code. Please fix.”
  • Example
    • proj/src/stuff_you_want_to_test.py
    • proj/tests/test_code.py
    • You can’t import stuff_you_want_to_test.py from the proj/tests directory by default.
  • The more I look at the problem, the more I appreciate the simplicity of pytest-pythonpath
  • pytest-pythonpath does one thing I really care about:

    • Add this to a pytest.ini file at the proj level:
          [pytest] 
          python_paths = src
      
  • That’s it. That’s all you have to do to fix the above problem.

  • Paths relative to the directory that pytest.ini is in. Which should be a parent or grandparent of the tests directory.
  • I really can’t think of a simpler way for people to get around this problem.

Hannah #3: Thinking in Pandas

  • Pandas dependency hierarchy (simplified):
    • Pandas -> NumPy -> BLAS (Basic Linear Algebra Subprograms)
  • Languages:
  •     - Python  ->      C     -> Assembly
        df["C"] = df["A"] + df["B"]
    
        A = [ 1
              4
              2
              0 ]
        B = [ 3
              2
              5
              1 ]
        C = [ 1 + 3
              4 + 2
              2 + 5
              0 + 1 ]
    
  • Pandas tries to get the best performance by running operations in parallel.

  • You might think we could speed this problem up by doing something like this:

        Thread 1: 1 + 3
        Thread 2: 4 + 2
        Thread 3: 2 + 5
        Thread 4: 0 + 1
    
  • However, the GIL (Global Interpreter Lock) prevents us from achieving the performance improvement we are hoping for.

  • Below is an example of a common threading problem and how a lock solves that problem.
  •     Thread 1                  total                    Thread 2
         1 + 3 + 4 + 2              0                       0 + 5
         10                         0                       + 6 + 2
         total += 10                0                       13
         total =10                  0                       total += 13
                                    10                      total = 13
                                    13
    
        Thread 1                  total                    Thread 2
         1 + 3 + 4 + 2              0 unlocked              0 + 5 
         10                         0 unlocked              + 6 + 2           
         total += 10                0 locked                13
         total =10                  0 locked            
                                    10 unlocked
                                    10 locked               total += 13 
                                    10 locked               total = 13
                                    23 unlocked
    
  • As it turns out, because Python manages memory for you every object in Python would be subject to these kinds of threading issues:

        a = 1     # reference count = 1
        b = a     # reference count = 2
        del(b)    # reference count = 1
        del(a)    # reference count = 0
    
  • So, the GIL was invented to avoid this headache which only lets one thread run at a time.

  • Certain parts of the Pandas dependency hierarchy are not subject to the GIL (simplified):
    • Pandas -> NumPy -> BLAS (Basic Linear Algebra Subprograms)
    • GIL -> no GIL -> hardware optimizations
  • So we can get around the GIL in C land but what kind of optimizations does BLAS provide us with?
    • Parallel operations inside the CPU via Vector registers
  • A vector register is like a regular register but instead of holding one value it can hold multiple values.

    | 1 | 4 | 2 | 0 |
    
    
                +                                            +                                         +                                        +
    
    | 3 | 2 | 5 | 1 |
    
    
                =                                            =                                         =                                         =  
    
    | 4 | 6 | 7 | 1 |
    
  • Vector registers are only so large though, so the Dataframe is broken up into chunks and the vector operations are performed on each chunk.

Michael #4: Quickle

  • Fast. Benchmarks show it’s among the fastest serialization methods for Python.
  • Safe. Unlike pickle, deserializing a user provided message doesn’t allow for arbitrary code execution.
  • Flexible. Unlike msgpack or json, Quickle natively supports a wide range of Python builtin types.
  • Versioning. Quickle supports “schema evolution”. Messages can be sent between clients with different schemas without error.
  • Example
        >>> import quickle
        >>> data = quickle.dumps({"hello": "world"})
        >>> quickle.loads(data)
        {'hello': 'world'}
    

Brian #5: what(), why(), where(), explain(), more() from friendly-traceback console

  • Do this:

        $ pip install friendly-friendly_traceback.install() 
        $ python -i
        >>> import friendly_traceback
        >>> friendly_traceback.start_console() 
        >>>
    
  • Now, after an exception happens, you can ask questions about it.

        >>> pass = 1
    
        Traceback (most recent call last):
          File "[HTML_REMOVED]", line 1
            pass = 1
                 ^
        SyntaxError: invalid syntax
        >>> what()
            SyntaxError: invalid syntax
    
            A `SyntaxError` occurs when Python cannot understand your code.
    
        >>> why()
            You were trying to assign a value to the Python keyword `pass`.
            This is not allowed.
    
        >>> where()
            Python could not understand the code in the file
            '[HTML_REMOVED]'
            beyond the location indicated by --> and ^.
    
            -->1: pass = 1
                       ^
    
  • Cool for teaching or learning.

Hannah #6: Bandit

  • Bandit is a static analysis security tool.
  • It’s like a linter but for security issues.

        pip install bandit
        bandit -r .
    
  • I prefer to run it in a git pre-commit hook:

    # .pre-commit-config.yaml
        repos:
           repo: https://github.com/PyCQA/bandit
           rev: '1.7.6'
           hooks:
           - id: bandit
    
  • It finds issues like:

  • You can ignore certain issues just like any other linter:
        assert len(foo) == 1  # nosec
    

Extras:

Brian:

  • Meetups this week 2/3 done.
    • NOAA Tuesday, Aberdeen this morning - “pytest Fixtures”
    • PDX West tomorrow - Michael Presenting “Python Memory Deep Dive”
  • Updated my training page, testandcode.com/training
    • Feedback welcome.
    • I really like working directly with teams and now that trainings can be virtual, a couple half days is super easy to do.

Michael:

Joke:

Sent in via Michel Rogers-Vallée, Dan Bader, and Allan Mcelroy. :)

PEP 8 Song

Watch this episode on YouTube
Play on YouTube
Watch the live stream replay


Want to go deeper? Check our projects