#220: What, why, and where of friendly errors in Python

Published Thu, Feb 11, 2021, recorded Thu, Feb 11, 2021

Sponsored by Datadog: pythonbytes.fm/datadog

Special guest: Hannah Stepanek

Play on YouTube

Watch the live stream replay

Michael #1: We Downloaded 10,000,000 Jupyter Notebooks From Github – This Is What We Learned

by Alena Guzharina from JetBrains
Used the hundreds of thousands of publicly accessible repos on GitHub to learn more about the current state of data science. I think it’s inspired by work showcased here on Talk Python.
2 years ago there were 1,230,000 Jupyter Notebooks published on GitHub. By October 2020 this number had grown 8 times, and we were able to download 9,720,000 notebooks. 8x growth.
Despite the rapid growth in popularity of R and Julia in recent years, Python still remains the most commonly used language for writing code in Jupyter Notebooks by an enormous margin.
Python 2 went from 53% → 11% in the last two years.
Interesting graphs about package usage
Not all notebooks are story telling with code: 50% of notebooks contain fewer than 4 Markdown cells and more than 66 code cells.
Although there are some outliers, like notebooks with more than 25,000 code lines, 95% of the notebooks contain less than 465 lines of code.

Brian #2: pytest-pythonpath

plugin for adding to the PYTHONPATH from the pytests.ini file before tests run
Mentioned briefly in episode 62 as a temporary stopgap until you set up a proper package install for your code. (cringing at my arrogance).
Lots of projects are NOT packages. For example, applications.
I’ve been working with more and more people to get started on testing and the first thing that often comes up is “My tests can’t see my code. Please fix.”
Example
- proj/src/stuff_you_want_to_test.py
- proj/tests/test_code.py
- You can’t import stuff_you_want_to_test.py from the proj/tests directory by default.
The more I look at the problem, the more I appreciate the simplicity of pytest-pythonpath
pytest-pythonpath does one thing I really care about:
- Add this to a pytest.ini file at the proj level:
```
    [pytest] 
    python_paths = src
```
That’s it. That’s all you have to do to fix the above problem.
Paths relative to the directory that pytest.ini is in. Which should be a parent or grandparent of the tests directory.
I really can’t think of a simpler way for people to get around this problem.

Hannah #3: Thinking in Pandas

Pandas dependency hierarchy (simplified):
- Pandas -> NumPy -> BLAS (Basic Linear Algebra Subprograms)
Languages:

    - Python  ->      C     -> Assembly
    df["C"] = df["A"] + df["B"]

    A = [ 1
          4
          2
          0 ]
    B = [ 3
          2
          5
          1 ]
    C = [ 1 + 3
          4 + 2
          2 + 5
          0 + 1 ]

Pandas tries to get the best performance by running operations in parallel.

You might think we could speed this problem up by doing something like this:

    Thread 1: 1 + 3
    Thread 2: 4 + 2
    Thread 3: 2 + 5
    Thread 4: 0 + 1

However, the GIL (Global Interpreter Lock) prevents us from achieving the performance improvement we are hoping for.
Below is an example of a common threading problem and how a lock solves that problem.

    Thread 1                  total                    Thread 2
     1 + 3 + 4 + 2              0                       0 + 5
     10                         0                       + 6 + 2
     total += 10                0                       13
     total =10                  0                       total += 13
                                10                      total = 13
                                13

    Thread 1                  total                    Thread 2
     1 + 3 + 4 + 2              0 unlocked              0 + 5 
     10                         0 unlocked              + 6 + 2           
     total += 10                0 locked                13
     total =10                  0 locked            
                                10 unlocked
                                10 locked               total += 13 
                                10 locked               total = 13
                                23 unlocked

As it turns out, because Python manages memory for you every object in Python would be subject to these kinds of threading issues:

    a = 1     # reference count = 1
    b = a     # reference count = 2
    del(b)    # reference count = 1
    del(a)    # reference count = 0

So, the GIL was invented to avoid this headache which only lets one thread run at a time.
Certain parts of the Pandas dependency hierarchy are not subject to the GIL (simplified):
- Pandas -> NumPy -> BLAS (Basic Linear Algebra Subprograms)
- GIL -> no GIL -> hardware optimizations
So we can get around the GIL in C land but what kind of optimizations does BLAS provide us with?
- Parallel operations inside the CPU via Vector registers

A vector register is like a regular register but instead of holding one value it can hold multiple values.

| 1 | 4 | 2 | 0 |


            +                                            +                                         +                                        +

| 3 | 2 | 5 | 1 |


            =                                            =                                         =                                         =  

| 4 | 6 | 7 | 1 |

Vector registers are only so large though, so the Dataframe is broken up into chunks and the vector operations are performed on each chunk.

Michael #4: Quickle

Fast. Benchmarks show it’s among the fastest serialization methods for Python.
Safe. Unlike pickle, deserializing a user provided message doesn’t allow for arbitrary code execution.
Flexible. Unlike msgpack or json, Quickle natively supports a wide range of Python builtin types.
Versioning. Quickle supports “schema evolution”. Messages can be sent between clients with different schemas without error.

Example

    >>> import quickle
    >>> data = quickle.dumps({"hello": "world"})
    >>> quickle.loads(data)
    {'hello': 'world'}

Brian #5: what(), why(), where(), explain(), more() from friendly-traceback console

Do this:

    $ pip install friendly-friendly_traceback.install() 
    $ python -i
    >>> import friendly_traceback
    >>> friendly_traceback.start_console() 
    >>>

Now, after an exception happens, you can ask questions about it.

    >>> pass = 1

    Traceback (most recent call last):
      File "[HTML_REMOVED]", line 1
        pass = 1
             ^
    SyntaxError: invalid syntax
    >>> what()
        SyntaxError: invalid syntax

        A `SyntaxError` occurs when Python cannot understand your code.

    >>> why()
        You were trying to assign a value to the Python keyword `pass`.
        This is not allowed.

    >>> where()
        Python could not understand the code in the file
        '[HTML_REMOVED]'
        beyond the location indicated by --> and ^.

        -->1: pass = 1
                   ^

Cool for teaching or learning.

Hannah #6: Bandit

Bandit is a static analysis security tool.
It’s like a linter but for security issues.
```
    pip install bandit
    bandit -r .
```

I prefer to run it in a git pre-commit hook:

# .pre-commit-config.yaml
    repos:
       repo: https://github.com/PyCQA/bandit
       rev: '1.7.6'
       hooks:
       - id: bandit

It finds issues like:
- flask_debug_true
- request_with_no_cert_validation
You can ignore certain issues just like any other linter:
```
    assert len(foo) == 1  # nosec
```

Extras:

Brian:

Meetups this week 2/3 done.
- NOAA Tuesday, Aberdeen this morning - “pytest Fixtures”
- PDX West tomorrow - Michael Presenting “Python Memory Deep Dive”
Updated my training page, testandcode.com/training
- Feedback welcome.
- I really like working directly with teams and now that trainings can be virtual, a couple half days is super easy to do.

Michael:

Joke:

Sent in via Michel Rogers-Vallée, Dan Bader, and Allan Mcelroy. :)

PEP 8 Song

Play on YouTube

Watch the live stream replay

By Leon Sandoy and team at Python Discord

Episode Transcript

Collapse transcript

WebVTT format On GitHub

00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to

00:04 your earbuds. This is episode 220, recorded February 10th, 2021. I'm Michael Kennedy.

00:10 I'm Brian Okken.

00:11 And we have a special guest, Hannah. Welcome.

00:12 Hello.

00:13 Hannah Stepnick, welcome to the show. It is so great to have you here.

00:16 Thank you. I'm happy to be here.

00:18 Yeah, it's good to have you. It's so cool. The internet is a global place. We can have

00:23 people from all over. So we've decided to make it an all Portland show this time.

00:27 We could do this in person, actually. Well, not really, because we can't go anywhere. But

00:30 theoretically, geographically, anyway. Yeah, so all three of us are from Portland, Oregon. Very nice.

00:35 Before we jump into the main topics, a few quick things. One, this episode is brought to you by

00:41 Datadog. Check them out at Pythonbytes.fm.datadog. And Hannah, do you just want to give people a quick

00:45 background on yourself?

00:47 Yeah, so I'm Hannah. I have written a book, which is weird to say, about pandas. But I also just go

00:56 around, like, give talks at various conferences, like on Python. So yeah, like I gave re-architecting

01:03 legacy code base recently.

01:05 That sounds interesting and challenging.

01:06 Yeah.

01:07 What was the legacy language? Was it Python or something?

01:10 It was Python. It was like a Flask web application. And then also the front end of it was Vue, like

01:18 Vue.js.

01:19 Oh, yeah.

01:20 So yeah, that's been a fun project. That was through work as developers. Like, you're pretty much always

01:26 working with some form of legacy code. Just depends on how legacy it really is.

01:30 Well, what could be cutting edge in one person's viewpoint might be super legacy in another, right?

01:36 Like, it's Python 3.5. You wouldn't believe it.

01:38 Right.

01:39 Yeah. Very cool. Well, it's great to have you here. I think maybe we'll start off with our

01:46 first topic, which is sort of along the lines of the data science world, some tie-ins to your book.

01:51 And of course, whenever you go to JetBrains, you've got to run your CLI to accept the cookies,

01:56 which is fantastic. And so this topic, this first topic I want to cover is from JetBrains. And it's

02:02 entitled, we downloaded 10 million Jupyter notebooks. I almost said 10,000. 10 million Jupyter notebooks

02:08 from GitHub. Here's what we learned. So this is an article or analysis done by Elena, who's a Harina.

02:14 And yeah, pretty neat. So they went through and downloaded a whole bunch of these notebooks and

02:20 just analyzed them. And there's many, many of them are publicly accessible. And a couple of years ago,

02:25 there were 1.2 million Jupyter notebooks that were public. As of last October, it was eight times as

02:33 many. 9.7 million notebooks available on GitHub. That's crazy, right?

02:37 Wow.

02:38 Yeah. So this is a bunch of really nice pictures and interactive graphs and stuff. So I encourage

02:43 people to go check out the webpage. So for example, one of the questions was, well, what language do you

02:49 think is the most popular for data science, just by judging on the main language of the notebook?

02:54 Anna, you want to take a guess?

02:55 Oh yeah. Python, for sure. Without a doubt.

02:58 That's for sure. The second one, I'm pretty sure no one who's not seen this, there's no way they're

03:05 going to guess. It's Nan. We have no idea. Like we look, we can't tell what language this is in there.

03:13 But then the other contenders are R and Julia. And often people say, oh yeah, well, Julia,

03:18 maybe I should go to Julia from Python. Well, maybe, but that's not where the trends are. Like

03:22 there's 60,000 versus 9 million, you know, as the ratio, I don't know what that number is,

03:26 but it's a percent of a percent type of thing. Wow.

03:29 They also talk about the Python two versus three growth or different. So in 2008, it was about 50%

03:37 was Python two. And in 2020, it's a Python two is down to 11%. And I was thinking about this 11%.

03:43 Like, why do you guys think people, there's still 11% there hanging around?

03:47 I mean, I would guess, speaking of legacy applications, probably it's just hasn't been

03:53 touched, but yeah. Yeah. Those are very likely the ones that were like the original 2016, 17 ones that

03:59 were not quite there. They're still public, right? GitHub doesn't get rid of them. The other one is I

04:04 was thinking, you know, a lot of people do work on Mac or maybe even on some Linux machines that just

04:10 came at the time with Python two. So they're just like, well, I'm not going to change anything. It just,

04:13 I just need to view this thing. I don't have Python problem solved, right? They didn't know

04:17 that there's more, more than one Python. There's a good breakdown of the different versions. Another

04:21 thing that's interesting is looking at the different languages, not language, different libraries used

04:27 during this. So like NumPy is by far the most likely used. And then a tie is pandas and matplotlib,

04:32 and then scikit-learn, and then OS actually for traversing stuff. And then there's a huge long tail.

04:37 And they also talk about combinations like pandas and NumPy are common, and then pandas,

04:41 and then like seaborn, scikit-learn, pandas, NumPy, matplotlib, and so on as a combo. And so

04:46 that's really interesting, like what sets of tools data scientists are using. Yeah. And then another

04:51 one is they looked at deep learning libraries and PyTorch seems to be crushing it in terms of growth,

04:56 but not necessarily in terms of popularity. So it grew 1.3 times or 130%, whereas TensorFlow is more

05:03 popular, but only grew 30% and so on. So there's a lot of these types of statistics in there. I think

05:07 people will find interesting if they want to dive more into this ecosystem. You know, it's one thing

05:12 to have survey and go fill out the survey, like ask people, what do you use? You know, what platform do

05:17 you run on? Vue.js or Linux? Like, okay, well, that's not really a reasonable question, but I guess

05:21 Vue.js, you know, like, but if you just go and look at what they're actually doing on places like

05:26 GitHub, I think you can get a lot of insight. Yeah, for sure. Yeah, I know I use, like I'll go to GitHub

05:31 pretty frequently, like at work when I'm, you know, just like browsing, like, I wonder how you do this

05:36 thing or like, what's the most common way to do this? Or yeah, absolutely. There's just look up,

05:40 like, what's the most popular. So it's a pretty good sign if a lot of people are using it.

05:44 It is. One thing I should probably make more better use of is I know they started adding dependencies,

05:49 like, oh, if you go to Flask, it'll show you Flask is used in these other GitHub repos and stuff.

05:54 Like you could find interesting little connections. I think, oh, this other project uses this cool

05:58 library. I know nothing about, but if they're using it, it's probably good. Yeah, for sure.

06:01 Yeah. I love the dependency feature of looking who's using it. It's neat. Yeah, absolutely. So,

06:07 Brian, you going to cover something on testing this time? Yeah. Will we make you?

06:11 I wanted to bring up something we brought up before. So there's a project called pytest Python Path,

06:19 and it's just a little tiny plugin for pytest. And we did cover it briefly in way back in episode 62,

06:27 two. But at the time I brought it up as, so, okay. So the, I brought it up as a way to, to,

06:34 to just shim, like be able to have your test code, see your source code, but as just like a shortcut,

06:42 like a stop gap until you actually put together like proper packaging for your source code. But the

06:47 more I talked to real life people were testing all sorts of software and hardware, even there's,

06:55 there that that's a simplistic view of the world. So thinking of everybody is working on,

07:00 on packages is, is not real. There's applications for instance, that, that they're never going to set

07:07 up, pull their code together as a package. And that's, that's, that's legitimate. So if you have an

07:13 application and your, your source code is in your source directory and your test code is in your test

07:18 directory, it's just, your tests are just not going to be able to see your source code right off

07:23 the bat. So what's more, tricky is depending on how you run it, they will, or they won't.

07:30 Yeah. Right. Right. If you say run it with PyCharm and you open up the whole thing and it can like put

07:34 together the past, you're all good. But if you then just go into the directory and type pytest, well,

07:38 maybe not.

07:38 It doesn't work. And it just confuses a lot of people. And so more and more, I'm recommending people

07:44 use this, this little plugin and really, the big benefit is it gives you there's,

07:52 there's, there's a, it does a few things, but the big biggie is just, you can add a Python path,

08:00 setting within your pytest, any file, and you stick your, any file at the top of your project.

08:05 And then you just give it a relative path to where your source code is like source or SRC

08:11 or something else. And then pytest from then on, we'll be able to see your source code.

08:17 It's a really simple solution. It's just, I, I, that's way better than what I do.

08:23 I don't think it's a stop gap. I think it's awesome. So yeah, I totally agree. What I do a lot of times

08:27 is certain parts of my code. I'm like, this is going to get imported. For me, the real tricky thing is

08:32 a limbic, the database, database migration tool and the tests and the web app. And usually I can get the

08:39 tests and the web app to work just fine running them directly. But for some reason, a limbic always

08:43 seems to get weird, like working directories that don't line up in the same way. So it can't import

08:47 stuff. So a lot of times I'll put at the top of some file, you know, go to the Python path and add,

08:54 you know, get the directory name from dunder file and go to the parent, add that to the Python path.

09:00 And now it's going to work from then on basically. And, this seems like a nicer one, although it doesn't

09:05 help me with the limbic, but still, but it, it might, you might be able to add the limbic path right to it.

09:11 So yeah, yeah, for sure. Very cool. So it says, yeah, go ahead, Hannah.

09:14 Oh, I was just going to say, yeah, like this is something I like pretty much every time I set up a new project.

09:19 Like I always have to screw with the Python path. I always like run it initially. And then it's like, Oh,

09:25 can't find blah, blah, blah. And I'm like, Oh, here we go again.

09:28 But I usually always run my projects from Docker though. So I just, you know, hard code that stuff,

09:35 like just once you get it set up. That's cool. Nice. I dream of days when I can use Docker again,

09:41 have an M one Mac and it's in super early, early beta stages. Yeah. It's okay. I don't,

09:47 I don't mind too much because I don't use it that much, but still cool. Brian,

09:51 it says something about dot PTH. I'm guessing path files. What do you know anything about this?

09:55 I have no idea what those are. Oh, dot PTH files. So there's yeah, there's there.

10:01 There are a way to, I don't know a lot. I don't know the detail, the real big details, but it's,

10:07 it's a way to have a you can have a list of different paths when it, within that file. And if you import it

10:15 or don't import it, if you include it in your path, then Python, I think includes all of the

10:22 contents into anyway, I'm actually, I'm blowing smoke. I don't know the details. Okay. Sorry.

10:26 Yeah. But apparently you can have a little more control with TH files, whatever those are.

10:30 Yeah. I don't know much about that either. Yeah. Unfortunately. I mean, I've been using

10:34 OS dot path. So what do I know? All right. Speaking of what do I know? I could definitely learn more

10:40 about pandas and that's one of your items here, huh? Hannah? Yeah. So I thought maybe I just give

10:48 like a little snippet of kind of like some of the stuff I talk about in the book. Fantastic. So yeah,

10:55 here we go. So if we're looking at pandas in terms of like the dependency hierarchy, well,

11:03 and I guess I should start at the beginning. So what is pandas if you're not familiar with it?

11:08 It's a data analysis library for Python. So it's used for doing big data operations. And so like,

11:16 if we look at the dependency hierarchy of pandas, it kind of goes like pandas, which is dependent on

11:21 numpy, which deep down is dependent on this thing called BLOS, which is basic linear algebra subprograms.

11:28 Right. And wasn't there something with BLOS and a Windows and a Windows update in a certain version,

11:33 I think recently? I can't remember. I feel like there was some update that like made that thing

11:37 that wasn't working. Yeah. Usually a big challenge around numpy and versioning and stuff to make it

11:42 work in the short term. Yeah. Usually the BLOS library is built into your OS already. And it just

11:49 points at that. But if you're using something like Anaconda, I think by default, like it installs

11:55 Intel MLK and uses that. But yeah, if you're using like Linux or just like out of the box,

12:01 whatever's on Windows, which is what it is, if you like pip install it, then yeah, there could

12:06 certainly be issues with like dependencies mismatches. Yeah. So, and I've like greatly simplified this,

12:15 but in terms of kind of like the languages and walking down that dependency hierarchy,

12:22 you start out in Python with pandas and then numpy is partially Python and partially C and then BLOS is

12:31 pretty much always written in assembly. And if you don't know what assembly is, it's basically like a

12:35 very, very, very, like probably the lowest level language you can program in. And it's essentially

12:40 like CPU instructions for your processor. And so I've taken this just like basic example here and I'm

12:48 going to kind of like roll with it. So if we're doing just like a basic addition in pandas, say like

12:56 we have column A and we want to add that with column B and like store it back into column C.

13:01 Like a traditional linear algebra vector addition type thing.

13:05 Traditional vector math. So pandas, like if you, if you look at these operations, each,

13:13 each of these like additions on a per row basis is independent, meaning like you could conceivably run

13:20 like each of those additions for each row, like in parallel. Like there's no reason why you have to go

13:25 like row by row. and that's essentially like what kind of like big data analysis libraries are

13:32 like at their core is they, they like understand this conceptually and try to parallelize things as

13:38 much as possible. and so that's kind of like the first like fundamental understanding that you have

13:42 to have, like when working with pandas is like, you should be doing things in parallel as much as you

13:47 can. which means understanding the API and understanding like which functions in the API

13:51 will let you do things in parallel. so like if we're just not using pandas at all, say like

13:59 we're just inventing our own sort of like technique for this, like you might think, well, like each of

14:04 these rows could be broken up like into a thread, right? So like we could say like thread one is going to

14:09 run like the first row addition. And then like thread two is going to run the second row, et cetera.

14:14 but you might find that we'll run into issues with this, in terms of the GIL. So like the gill

14:21 is now otherwise known as the global interpreter lock in Python, prevents us from really like

14:27 running a multi-threaded app, operation, like in parallel. basically Python can run the rule

14:35 is it can run one Python op code at a time and that's it. All right. It doesn't matter if you've

14:41 got, you know, 16 cores, it's one at a time. Yeah. Yeah. And this like is really terrible for,

14:50 yeah. For, for like trying to do things in parallel. Right. So like that, that kind of

14:56 use cases out, like pandas and numpy and, and all that stuff is, is not going to be able to use

15:01 multi-threading. and so, and like, I just want to point out like Python, like at its core has

15:10 this like fundamental problem, which is why they went with the GIL. So like Python manages memory for you.

15:17 and it, how it does that is it keeps track of references to know when to, free up memory.

15:26 so like when memory can be like completely destroyed and somebody else can use it essentially.

15:33 and like that's something you've got to do stuff like Brian sometimes probably has to do with C and

15:37 like free and all those things. Right. Yeah, exactly. Yeah. Yeah. So like C, you have to do this with

15:43 yourself with like Malik and free and all that stuff. But, with Python, it does it for you,

15:49 but that comes at a cost, which means like every single object in Python has this little like counter,

15:54 which is like a reference counter. and so basically like way back in the day, like when

16:00 threading first became a thing, like in order to kind of like avoid this threading problem,

16:07 they came up with the gill, which basically says you can only run one third at a time or like

16:13 one opcode at a time as, as you said.

16:15 And attempts have been made to remove it. Like Larry Hastings has been working on something

16:20 called the galectomy, the removal of the GIL for a while. And the main problem is, if you take

16:25 it away, the way it works now is you have to do lock on all memory access, all variable access,

16:30 which actually has a bigger hit than a lot of the benefits you would get, at least in the single

16:35 threaded case. And I know Peter said like, if we really don't want to make changes to this,

16:39 if it's going to mean slower, single threaded Python, they'll probably not for a while.

16:43 Yeah. Yeah. Yeah. And that, that is a big problem. So like, I mean, if generally what people use,

16:49 like instead of threads in Python is they use like multi-process and they spin up multiple Python

16:55 processes. Right. And like that truly kind of like achieves the parallelism. but anyways,

17:01 I digress. so, so we can't use the gill, but what's interesting to note is when you're,

17:10 running NumPy at its very low level in C, like when you enter and look at the C files,

17:16 it actually is not subject to the GIL anymore because you're in C. and so you can potentially

17:21 run, you know, multi-threaded things in C, and call it from Python. so, but beyond that,

17:31 if we look at BLOS, BLOS has, built in like parallelization for like, hardware parallelization.

17:38 and how it does that is through vector registers. so if you're not familiar with like the

17:46 architecture of CPUs and stuff, like at its core, you basically, only have like, only can

17:53 have a certain small set, maybe like three or four values in your CPU at any one time that you're running

18:00 like ads and multiplies on. and like how that works is you load those values like into the CPU from

18:07 memory. And that load can be quite time consuming. It's really just based on like how far away your memory is from

18:14 from your CPU at the end of the day, like physically on your board. Right. Right. Is it in the cache?

18:18 Is it in the RAM? Yes. Yeah. And that's why we have caches. So like caches are like memory that's closer

18:24 to your CPU. consequently it's also smaller. but that's, that's how you can kind of, you might hear

18:31 like people say like, oh, like so-and-so wrote this really performant program and it like utilizes like the

18:37 size of the cache or whatever. So like basically like if you can load all of that data, like into your cache and

18:43 run the operations on it without ever like having to go back out to memory, like you can make a really

18:48 fast program. Yeah. Yeah. It could be like a hundred times faster than regular memory. Yeah. Yeah. And so

18:53 essentially like that's what, BLOS is trying to do like underneath and, and to NumPy is they're trying

19:00 to take this giant set of data and break it into chunks and load those chunks into your cache and

19:08 operate on those chunks. and then dump them back out to memory and load the next chunk.

19:13 yeah, very cool. Yeah. Thanks for pointing that out. Like I didn't realize that BLOS leveraged some of

19:18 the OS native stuff, nor that it had like special CPU instruction type optimizations. That's pretty cool.

19:24 Yeah. Yeah. so like it has, like on top of the registers, it also has these things called

19:31 like vector registers, which actually can hold like multiple values at a time in your CPU. so like,

19:38 we could take this like simple example of, like the addition and we could actually, well, we can't

19:43 run those like row per row calculations, in parallel with threads. We can with vector registers.

19:51 Okay. and the limitation there is that the memory has to be, sequential when you load it in.

19:57 this is definitely at a level lower than I'm used to working at. How about you?

20:03 But yeah, so, anyways, this is just like kind of the stuff that I talk about in my book.

20:08 it's not necessarily about like how to use pandas. but it's, it's about like kind of like

20:14 what's going on underneath pandas. And then like, once you kind of like build that foundation of

20:18 understanding, like you can understand like better how pandas is working and like how to use it correctly

20:24 and what all the various functions are doing. Fantastic. Yeah. So people can check out your book.

20:28 Got a link to it in the show notes. So, very nice. It's offering me the European,

20:33 the Euro price, which is fine. I don't mind. So yeah. So like, I mean, it's on Amazon too.

20:38 It's on a lot of different platforms, but I figured I'd just point directly to the publishers.

20:43 Yeah, no, that's perfect. Perfect. quick comment. Roy Larson says, NumPy and Intel MKL cause issues. Sometimes you could learn on windows. If something else in the system

20:54 uses Intel MKL. Okay. Yeah. Interesting. I have no experience with that, but I can believe it. Intel

21:00 has a lot of interesting stuff. They even have a special iPhone, compiled version,

21:04 I think for Intel if you use potentially, I'm not sure they have some high performance version.

21:08 Yeah. Yeah. Yeah, they do. Yeah.

21:10 Nice. Also in Portland, you can keep it in Portland. There we go.

21:15 Now, before we move on to the next item, let me tell you about our sponsor today.

21:19 Thank you to data dog. So they're sponsoring data dog. And if you're having trouble visualizing latency,

21:25 CPU, memory bottlenecks, things like that in your app, and you don't know why you don't know where it's

21:30 coming from or how to solve it, you can use data dog to correlate logs and traces at the level of

21:35 individual requests, allowing you to quickly troubleshoot your Python app. Plus they have

21:38 a continuous profiler that allows you to find the most resource consuming parts of your production code

21:44 all the time at any scale with minimal overhead. So you just point out your production

21:47 server, run it, which is not normally something you want to do with diagnostic tools, but you can with

21:51 their continuous profiler, which is pretty awesome. You'll be the hero that got that app back on track at

21:56 your company, get started with a free trial at pythonbytes.fm/datadog, or just click the link in

22:01 your podcast player show notes. Now, I'm sure you all have heard that working with pickle has all sorts

22:08 of issues, right? The pickle is a way to say, take my Python thing, make a binary version of bits that

22:13 looks like that Python thing so I can go do stuff with it, right? That's generally got issues, not the

22:19 least of which actually are around the security stuff. So like you unpickle something to deserialize it,

22:25 sit back is actually potentially running arbitrary code. So people could send you a pickle virus.

22:30 I don't know what that is like a bad, a rotten pickle or whatever. That wouldn't be good.

22:34 So there's a library I came across that solves a lot of the pickle problems.

22:39 It's supposed to be faster than pickle and it was cleverly named quickle.

22:43 Either of you heard of this thing?

22:46 No.

22:47 Yeah, it's cool, right? So here's the deal. It's a fast serialization format for a subset of Python types.

22:54 So you can't pickle everything, but you can pickle like way more say than JSON. And the

22:59 reasons they give to use it are it's fast. If you check out the benchmarks, I'll pull those up in a

23:03 second. It's one of the fastest ways to serialize things in Python. It's safe, which is important.

23:09 And unlike pickle deserializing a user provided message does not allow arbitrary code execution.

23:14 That seems like the minimum bar. Like, oh, I got stuff off the internet. Let's try to execute that.

23:19 What's that going to do? Oh, look, it's reading all my files. That's nice.

23:22 All right.

23:23 It also, it's a flexible because it supports more types. And we'll also learn about a bunch of other

23:30 libraries while we're at it here, which is kind of cool. A bunch of things I heard of like

23:34 MSG pack or well, Jason, you may have heard of that. And the other main problem you get with some

23:39 of these binary formats is you can end up where in a situation where you can't read something.

23:44 If you make a change your code, like, so imagine I've, I've got a user object and I've pickled them

23:48 and put them into a Redis cache. We upgrade our web app, which adds a new field to the user object.

23:53 That stuff is still in cache. After we restart, we try to read it. Oh, that stuff isn't there anymore.

23:58 You can't, you know, user cache anymore. Everything's broken, et cetera, et cetera. So it has a concept of

24:03 schema evolution, having different versions of like history. So there's ways that older messages can be

24:09 read without errors, which is pretty cool. Yeah. That's nice. Yeah. Neat, huh? Yeah. I'll pull up the benchmarks.

24:14 There's actually a pretty cool little site here. It shows you some examples on how to use it. I mean,

24:18 it's incredibly simple. It's like, dump this as a string, read this, you know, deserialize this.

24:22 It's real simple. So, but there's quite interesting analysis, live analysis where you can click around

24:29 and you can actually look at like load speed versus reads like serialized versus deserialized speed,

24:35 how much memory is used and things like that. And it compares against pickle tuples,

24:39 protobuf, pickle itself, ORJSON, MSGPAC, quickle, and quicklestrux.

24:44 There's a lot of things. I mean, I knew about two of those, I think. That's cool.

24:48 But these are all different ways. And you can see, like in all these pictures, generally,

24:52 at least the top one where it's time shorter is better. Right? So you can see if you go with

24:57 there, like quicklestrux, it's quick rule of thumb, maybe four or five times faster than pickle,

25:02 which I presume is a way faster than JSON, for example.

25:04 You know, you'll also see the memory size, which actually varies by about 50% across the

25:09 different things. Also speed of loading up a whole bunch of different objects and so on. So yeah,

25:14 you can come check out these analysis here. Let's see all the different libraries that we had. Yeah,

25:19 I guess we read them all off basically there, but yeah, there's a bunch of different ways which are,

25:23 you know, not pickle itself to do this kind of binary serialization, which is pretty interesting.

25:28 I think it does. Protobuf, that's pretty cool. Actually, I want to try this out. It looks neat.

25:33 Yeah. Yeah, it looks really right. And one of the things I was just looking at the source code,

25:37 I love that they use pytest to test this. Of course, you should use pytest. But the, I can't believe

25:45 I'm saying this, but this would be the perfect package to test with a Gherkin syntax. Don't you think?

25:50 Because it's a pickle thing. Oh my gosh. You've got to use the Gherkin syntax.

25:54 So yeah, you definitely should. And Roy threw out another one like UQ foundation,

26:02 Dill package deals with many of the same issues, but because it's binary and has all the same

26:07 sort of versioning challenges you might run into. Well, Dill, the Dill package. That's funny.

26:12 Yeah, pretty good. Pretty good. All right. So anyway, like, you know, I'm,

26:16 I'm kind of a fan of JSON these days. I've had enough XML with custom namespaces in my life that

26:22 I really don't want to go down that path and XSLT and all that. But, you know, I've really shied away

26:27 from these binary formats for a lot of these reasons here. But, you know, this might make me interested.

26:33 If I was going to say throw something into a cache, the whole point is put it in the cache,

26:36 get it back, read it fast. This might be decent. Yeah. Yeah. It definitely seems to address a lot of the

26:42 concerns I have with pickle for sure. Yeah. And I don't, did I talk about the types

26:46 somewhere in here? We have time. Yeah. Here's, there's quite a list of types. You know, one's

26:50 really nice. Date time. I can't do that with JSON. Why is, why in the world doesn't JSON support

26:54 some sort of time information? Oh, well, but you've got most of the fundamental types that you might run

26:59 into. All right. So, quick, give it a quick look. All right, Brian, what you got here?

27:05 Well, I was actually reading a different article. But the, it came up, we, I think we've talked about

27:14 friendly traceback. It's a package that just sort of tries to make your tracebacks nicer. But,

27:20 but when I didn't realize it had a console built in. So I was pretty blown away by this. So there's a,

27:28 it's, you know, it's not trivial to get set up. It's not that terrible, but you,

27:31 you have to start your own console, start the REPL, import friendly traceback, and then do friendly

27:38 traceback start console. But at that point, you have just like the normal console, but you have better

27:45 tracebacks. And then also you have all these different cool functions you can call like,

27:50 what, what, what, where, why, and explain and more. And basically if something goes

27:58 wrong while you're playing with Python, you can interrogate it and ask like for more information.

28:04 And that's just pretty cool. The, the why is really great. So if you have the, one of the examples I saw

28:11 before, and I'm, I think I might start using this when teaching people is, we often have like

28:17 exceptions, like you assigned to none or you assigned to something that can't be assigned,

28:21 or you, you, you didn't match up the bracket and the parentheses or something like that correctly.

28:27 and you'll get like just syntax error and it'll point to the syntax error, but you might

28:32 not know more. So you can just type why a W H Y with parentheses. Cause it's a, or yeah,

28:39 because it's a function and it'll tell you why, why it's like a, the great storytelling,

28:45 right. The five Y's of a bug. Yeah. so then you get W's of a bug. Yep. You can, you can say

28:52 what, like to, to repeat what the error was, why we'll tell you why that was an error. And then

28:58 specifically what you did wrong. And then where it will show you if you've, if you've been asking

29:03 all sorts of questions and you lost where the actual trace back was, you can say where, and it'll point

29:08 to directly to it. And, I think this is going to be cool. I think I'll use this when trying to teach,

29:13 especially kids, but really just people new to Python. Tracebacks can be very helpful for them.

29:19 Yeah. Like even, I know, like I sometimes have to look up like certain error messages that I'm like,

29:24 not familiar with. So yeah, that would be super helpful. I could just do it right in the console.

29:28 Yeah. I totally agree. You're going to have to help me find a W that goes with this,

29:32 but I want the, what would be effectively Google open closed privacy?

29:40 You know, because so often you get this huge trace back and you've got these errors. And if

29:43 you go through and you select it, like for example, the area you see on the screen,

29:46 unbound local error, local variable greetings in quotes, reference before assignments. Well,

29:52 the quotes means oftentimes in search, like it must have the word greeting. And that's the one thing that

29:57 is not relevant to the Googling of it. Right? So if I'm a beginner and I even try to Google

30:02 that I might get a really wrong message. Right? So if you could say, Google this in a way that is

30:08 most likely going to find the error, but without carrying through like variable details, file

30:14 name details, but just the essence of the error, that would be fantastic. Now, how do we say that with W?

30:21 You could just say, Whoa, or, or maybe www or WTF. I mean, come on, there's a lot of WTF.

30:31 But wouldn't that be great. And so that's also part of this package that you see,

30:36 at their main site where you've got these really cool, like visualized stuff, right? Where it's

30:42 sort of more tries to tell you the problem of the error with the help text and whatnot.

30:45 Yeah. Yeah. This is cool. Also uses rich, which is a cool library we talked about as well.

30:49 I love rich. I include rich in everything now, even just, just to print out simple,

30:54 better tables. It's great. Yeah, for sure. Hannah, do you see yourself using this or is it,

30:59 are you more, more in a notebooks? Oh no. I mean, I usually use like the PDB debugger. So yeah,

31:07 I mean, I'm not sure if really this as it is would be, like a problem. It would depend on how

31:14 much information it has about like obscure errors from dependent libraries which is usually what I

31:20 end up looking at these days but yeah I mean conceivably like yeah that could be helpful

31:25 yeah if we get that WTF feature added then yeah oh yeah for sure gosh speaking of errors let's uh cover your last item last item of the show uh yeah so um I uh at work

31:39 uh work in um the security org and I write uh like automation tools for them which means uh

31:47 sometimes the repos that we work on get to be like test subjects um for for new like requirements and

31:55 such um and so recently uh our org was exploring uh like static code analysis looking for like

32:04 security vulnerabilities in the code um and so I ran across bandit and I integrated bandit

32:09 into our we don't have time to uh go through these old legacy code and fix these problems oh wait this

32:15 is what it means oh sorry yes we can do that right now that's the kind of report you gave you got from

32:21 bandit yeah exactly um so yeah we integrated bandit into our legacy code base and we actually it's funny

32:29 you say that because I the bug that I found using bandit was actually like a from the legacy code um

32:35 that does not surprise me yeah uh so it was it was a pretty stupid like error um like it was pretty

32:44 obvious like if you were doing code review but because it was legacy code and it was like already there

32:49 um I just like never noticed um but it was basically like issuing like a request with like no verify

32:55 uh so it was like an unverified like http request um and bandit was like yeah this broken ssl

33:03 certificate keeps breaking it I just told it to ignore it oh yeah yeah well and I honestly like I think that

33:09 might have been why it was there in the first place because I I know like the oh like several years ago

33:14 like had some certificate issues so yeah that might be and it was it was like an internal talking to

33:20 internal so it was like maybe even a self-signed certificate that nothing trusted but they get

33:26 technically there yeah yeah it was like we'll just we'll just do that um but yeah so um bandit is

33:33 basically like a linter but it looks for security issues um so you could just like pip install it um

33:40 and then just like run it on your code and it will find a bunch of different potential security issues

33:45 like just by like statically analyzing your code um and I've uh pretty much like come to the opinion

33:52 that like why haven't I done this on all of my other projects like I should be doing this on every single

33:58 project um like because you know like as as like a developer I always run like lint and black and stuff

34:05 like that um so I figure you know I should probably be running bandit too yeah cool yeah well very nice uh

34:12 it's a good recommendation for people as well and it's got a lot of cool you can go and actually see

34:16 the list of the things that it tests for and even has test plugins as well which is pretty cool yeah

34:21 yeah so you can like make your make your own if you want um and it has like all the common linter sort of

34:27 like functionality like ignore these files or like ignore these rules or even you know like ignore this

34:32 rule on this particular line stuff like that yeah absolutely which is pretty sweet I love that things like

34:37 bandit are around because um uh thankfully uh developing web stuff is becoming easier and easier

34:45 but it's then now the barrier to to entry is lower you still have to have all those security concerns

34:52 that you had before that normal I mean usually people were just had more experience but they would make

34:58 mistakes anyway but now I think this is one of the reasons why I love this is because people new to it

35:03 might be terrified about the security part but having uh bandit on there looking over their shoulder is

35:08 great yeah yeah like don't publish with the debug setting on and blast or jango or anything like that

35:14 simple obvious stuff and like honestly like having worked in the security org for about a year now like

35:21 I've come to the understanding that a lot of security issues stem from just like basic like duh sort of

35:29 misconfigurations so like something like this is perfect and I really really like that you added um

35:36 you you wrote in the show notes um some pre-commit uh how to how to hook this up with pre-commit because

35:42 I think having it in pre-commit or in a ci pipeline is important because um like you guys were joking

35:48 about often security problems come in because somebody's just trying to fix something that broke yeah but

35:54 they don't really realize how many other things it affects so yeah yeah yeah besides down just we got

36:00 to make it work quick just just turn on the debug thing we'll just look real quick and then you forget

36:03 to turn it off or whatever yeah yeah for sure yeah yeah just stupid human errors nice all right I want to go

36:11 back real quick Brian because uh your uh mentioned a friendly trace back got a lot of stuff so let me just do a

36:17 quick uh audience reaction Robert says it is cool Brian John Sheehan says I was just thinking it's something

36:23 the same would be cool it's a great teaching concept Anthony says super useful um John says I've been doing

36:28 more demo code in the console rather than the idea this looks like it would help w how to fix it w

36:36 wow how w i love it Robert very good Zach says uh what is this magic this looks amazing and so on all

36:44 right well thanks everyone uh I'm glad you all like that uh so that's it for our main items you know um

36:50 Brian you got any extras you want to throw out there you were uh doing something with climate change or what

36:55 are you doing this week um yeah I'm sharing the room with some people just a sec uh the uh I did do

37:02 two meetups uh with uh with uh Noah and uh then with the Aberdeen python meetup wait wait I got

37:09 I got to interrupt you really quick did all the talk that Hannah did about bandit and viruses get you

37:14 it's all right I'm sorry sorry about that carry on well I missed all this talk with Hannah that Hannah

37:25 had about viruses and in hacking and stuff with bandit did it make you nervous and you had to put on your

37:31 your mask no I just I'm in a group meeting in their group room and somebody came in but it's okay I'm

37:37 just teasing carry on um the that's funny I also wanted to look like a bandit yeah exactly but I was

37:44 thrilled that uh Noah uh asked me to to speak to them that was neat and then the python Aberdeen people

37:51 um and also like but they mentioned that Ian from the python Aberdeen group said that he had an arrangement

37:57 with you that when you Michael that when the pandemic is over you're gonna go over and they're

38:02 gonna you're gonna do like a whiskey tour or something like that so I'm I don't know the

38:06 details but it sounds good to me already anyway if that happens I want to go along yeah it's a python

38:12 bites outing let's do it and then we have uh uh there are pdx west meetup tomorrow you're gonna speak

38:19 that's kind of exciting so yeah it's gonna be fun and people as virtual so people can attend however

38:23 um I'm also I've got feedback from both uh you and um and Matt Harrison gave me some feedback so I'm

38:31 updating my training page on testing code so um because I really like working with teams so I'd and

38:37 anybody else wants to give me feedback on my training page maybe I could I'd love to hear it so yeah or maybe

38:43 they even want to have some high test training for their team yeah I mean testing is something that uh

38:48 I think teaching a team at a time is a great thing because people can uh can really um I don't know that

38:53 we can talk about their their particular problems not general problems it's good so yeah for sure well

38:59 you also need more of a team buy-in on testing right because like if one person writes code and

39:03 won't write the test another person is like really concerned about making the test fast it's super

39:07 frustrating yeah the person who doesn't want to run the test keeps it breaking the build but like you know

39:12 anyway it's a team sort of sport in that regard yep yeah all right awesome so I got a couple quick

39:17 things PEP 634 structural pattern matching in python has been accepted for python 310 that's like

39:24 imagine a switch case that has about a hundred different options that's what it is yeah with

39:29 like reg x not quite but sort of like style like you can have like these patterns and stuff that

39:34 happen in the cases I don't know how to feel about this like if uh let me put a perspective like if the

39:40 walrus operated was controversial like this is like this is like a way bigger change to the language so

39:45 I don't know it it's both awesome and terrifying yes exactly yeah I was gonna say I'm kind of surprised

39:51 yeah yeah so am I Hannah that like this got accepted it seemed to be sort of counter to the simplicity of

39:56 python like I I did not at all against having a simple switch statement that does certain things but

40:01 this seems like a lot I may come to love it one thing that maybe would help me come to a better

40:05 understanding and acceptance was if the PEP page had at least one example of it in use like the

40:10 whole page that talks about all the details says I don't believe there's a single code sample ever

40:15 well there's a tutorial page as well so oh is there there's the tutorial page okay maybe that's where

40:20 I should be going to check it out yeah but it still sort of feels like a five barrel foot gun yeah

40:25 it does well but the page that I'm looking like the pip thing that I'm listening to the official PEP I

40:29 don't think it has uh does it have the tutorial yeah no you're right it does it does um somewhere down

40:35 yeah PEP 636 yeah it's a different PEP that is the tutorial for the PEP interesting I didn't realize

40:40 that it's kind of meta honestly anyway I to me I'm a little surprised it's accepted fine um I know people

40:46 worked really hard on it and congratulations a lot of people really want it comes from Haskell right so

40:50 Haskell had this like pattern matching like alternate struct thing I don't know I just feel

40:53 like Haskell and Python are far away from each other so that's my first impression I will

40:58 probably come to love it at some point uh PyCon registration is open so if you want to go to PyCon

41:03 you want to attend and be more part of it than just like watching the live stream on YouTube be part of

41:07 that I think I'm going to try to make a conscious effort to attend the virtual conference not just

41:11 catch some videos so you can do that yeah PyCon is awesome like just I my first conference was PyCon

41:18 and then I went to other conferences and I was like what are wrong with these conferences like

41:23 yeah I know I feel the same way I know it's uh it's really really special I'm sure the virtual one

41:31 will be good I can't wait for the in-person stuff to come back because it really for sure yeah it's a

41:36 whole another experience in person I consider it basically my um geek holiday where I get get away

41:41 and like just get a hang out with my geek friends I happen to learn stuff on there totally

41:44 and then Python web comp is coming up and that's uh registration is open for that as well um and I

41:51 suppose probably PyCascades which Brian and I are on a panel at there as well oh nice I put I put a link

41:56 into an hour of code for Minecraft which has to do with programming Minecraft with Python if people are

42:01 looking to teach kids stuff uh that looks pretty neat so um my daughter's super into Minecraft I don't

42:06 do anything with it but if if you are and you want to make it part of your curriculum uh that's pretty

42:10 cool Hannah anything you want to throw out there before uh we break out the joke nope I'm good

42:15 awesome do it do it all right all right so this one we have something a little more interactive for

42:20 everyone we've got a um a song about PEP 8 about writing clean code this is written and and uh produced

42:28 sung by Leon Sandoy uh goes by Lemon and him and his team over at Python Discord he runs Python Discord and

42:34 apparently it was a team effort creating this and the reason I'm covered is a bunch of people sent it

42:38 over so Michael Rogers Valet uh sent it over so you should cover this Dan Bader said check this out

42:43 Alan McElroy said hey check out this thing so all right I actually uh spoke to Lemon and said hey do

42:49 you mind if we play this he said no that'd be awesome give us a shout out of course so we're

42:53 gonna actually play the song as part of this in the live stream you get the video on the audio you get

42:57 well audio so I'm gonna kick this off and we'll come back and I'd love to hear Brian and Hannah's thoughts

43:02 here we go you don't need any curly braces just for spaces just for spaces

43:26 wildcard imports should be avoided in most cases in most cases try to make sure there's no trailing white space it's confusing it's confusing

43:47 trailing commas go behind list items get blamed titans get blamed titans

43:57 and comments are important as long as they're maintained when comments are misleading it will drive people insane

44:09 just try to be empathic just try to be empathic just try to be a friend it's really not that hard just adhere to

44:22 papade. Papade.

44:33 constants should be named, all capital letters, and live forever, live forever.

44:44 And camel case is not for python, never ever, never ever.

44:55 And never use a bare exception, be specific, be specific.

45:06 No one likes the horizontal scroll bar, keep it succinct, keep it succinct.

45:17 And comments are important, as long as they're maintained.

45:23 When comments are misleading, it will drive people insane.

45:29 Just try to be empathic, just try to be a friend.

45:34 It's really not that hard, just adhere to.

45:40 Papade.

45:44 Papade.

45:50 Papade.

45:55 Papade.

46:02 Papade.

46:04 Papade.

46:08 Papade.

46:08 That was amazing.

46:09 I can sympathize with so much of what he's saying.

46:14 I'm just having flashbacks to a discussion I had with my teammate about comments.

46:19 And being like, "No, this comment doesn't actually describe what the comment is doing."

46:27 It's worse than having no comment. It really is.

46:30 It really is, yeah.

46:31 I love it.

46:32 Or if it describes literally what the code is doing and not high-level.

46:36 Exactly.

46:37 Why or background or anything other than...

46:40 The why.

46:41 The why is important.

46:42 Yeah.

46:43 I love it.

46:44 So, two things.

46:45 Lemon and team, well done on the song.

46:47 And man, you've got a great voice.

46:48 That's actually...

46:49 It was beautiful and funny.

46:51 Yeah.

46:52 It was amazing.

46:53 All right.

46:53 Well, Brian, we probably should wrap it up.

46:54 Yeah.

46:55 All right.

46:56 Well, Hannah, thanks so much for being here.

46:58 It's good to have you on the show.

46:59 And Brian, thanks as always.

47:00 Everyone, thanks for listening.

47:01 Thanks for having me.

47:02 Bye.

47:03 Bye, all.

47:04 Thank you for listening to Python Bytes.

47:05 Follow the show on Twitter via @pythonbytes.

47:07 That's Python Bytes as in B-Y-T-E-S.

47:10 And get the full show notes at pythonbytes.fm.

47:13 If you have a news item you want featured, just visit pythonbytes.fm and send it our way.

47:17 We're always on the lookout for sharing something cool.

47:20 On behalf of myself and Brian Okken, this is Michael Kennedy.

47:23 Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.

Want to go deeper? Check our projects

Course: Python for the Absolute Beginner course

Beginners

HTMX + Flask

FastAPI

pytest book

Full transcript