Transcript #47: PyPy now works with way more C-extensions and parking your package safely
Return to episode page view on github00:00 Michael KENNEDY: Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is Episode #47, recorded October 11th, 2017. I’m Michael Kennedy.
00:00 Brian OKKEN: And I’m Brian Okken.
00:00 KENNEDY: We’ve got a bunch of cool stuff lined up for you. Hey, Brian. How’s it going?
00:00 OKKEN: It’s going really good.
00:00 KENNEDY: Great. Before we get to your first item, I want to say thanks to DIgitalOcean. They sponsored a bunch of episodes coming up and they’re really supporting the show. The thing they want me to tell you about is Spaces, which is like Amazon S3 but three times better and you get a two month trial. So, check it out at do.co/Python and we’ll talk more about that later.
00:00 How about fast Python, Brian? What do you think?
00:00 OKKEN: I’m excited. So, PyPy is a fast implementation. It’s good to see that there is still working coming out. One of the exciting bits of news just recently is version 5.9, at least on the PyPy 2.7 version of this release, has Pandas and NumPy in it as well, which is super exciting.
00:00 KENNEDY: That’s actually a really big deal because they had not been supported. That’s one of the things that was a challenge with PyPy. It was great, it was much faster and in many ways it was five times faster than regular CPython. However, it didn’t support any of the C extensions, you could integrate things like NumPy and stuff. So, it was like you get a subset of Python that’s super fast but there might be things you don’t want to do and oh, by the way, a lot those things are computational, where people care about when it’s fast.
00:00 So, it’s awesome to see that coming on.
00:00 OKKEN: So, getting NumPy and Pandas come on, and I’m sure that eventually they’ll come on to the 3.5 branch as well.
00:00 KENNEDY: Yeah, for sure. I also have notes about Cython as well.
00:00 OKKEN: Yeah, so it includes Cython 0.2.7.1, which supports a lot more Cython projects on PyPy. I’m not sure what the Cython story was for this release but that’s pretty exciting.
00:00 KENNEDY: Yeah, that’s cool. I think the biggest news here is that CFFI (C Foreign Function Interface) has been updated and the C API extensions for many projects, now work with PyPy, whereas previously they did not. So, it’s not just Pandas and NumPy, those are the headline ones. There's a bunch of things that previously couldn’t work with PyPy because of the C extensions, but guess what? Now they can. That’s pretty awesome.
00:00 OKKEN: Yeah, another bit of news with this release is the optimized JSON parser for both memory and speed, which should help for people trying to pull in JSON. It’s good.
00:00 KENNEDY: Yeah, that’s awesome. I think people use JSON every now and then, not really sure. (Laughs) ALl the micro services, it’s like the network lights are JSON messages.
00:00 So, that’s really cool and that’s all pretty straightforward. I want to show you some stuff that is not straightforward.
00:00 So, there’s this project on GitHub that has really taken off. There’s a ton of people contributing to it. There’s 17 contributors who are doing a lot of work on this project – and it has about 3,600 stars – called “WTF Python?”
00:00 If you’ve seen the “WAT?” Video about JavaScript and Ruby, which is hilarious. Python is lucky in that there’s not that many weird edge cases, but this repository will show you that actually, there’s some weird cases. Have you seen this, Brian?
00:00 OKKEN: No, I haven’t. This is pretty funny.
00:00 KENNEDY: I pulled out four items but there’s a bunch. This is super active on GitHub, I’m getting all these notifications from it. It’s cool. One is about skipping lines. You say, ‘Value = 11. Value = 32. What is value? It’s 11.’ Huh? What is going on here. There’s another one that’s similar in the same section. ‘(‘E’) false.’ And things like that. It’s about encoding and some interesting stuff.
00:00 So, each one of these has a really simple three or four lines of code and then the explanation. The explanation, I think, is where this gets interesting. So, another one is modifying dictionaries. These are super good ways to trick people. ‘Create a dictionary with one item. Go through for each item in it. Delete that item and add a new one, and then print that out.’ How many times did that loop around, do you think?
00:00 OKKEN: I have no idea.
00:00 KENNEDY: It’s either one or error or something, I would guess, but the answer is eight. Exactly eight. Why is it eight? Why doesn’t it run one infinite or zero or error. Those three. Zero, one or infinity. Eight doesn’t make any sense. If you look at the implementation, the dictionaries are pretty allocated because you’re typically adding stuff. They want to grow in like a doubling sort of way. Not a, ‘Every time you add something it’s got to reallocate and copy around things.’ So, what they do is they pre-allocate a certain number of items and this trick leverages signing into those slots until it runs out. So, this is crazy.
00:00 I’ll give you one more example. ‘Is is not what it is.’ So, if you say, ‘A = 256, B = 256, A is B is true.’ However, if you say, ‘A is 257 and B is 257, A is B is false.’ Do you know why? It’s another crazy one. This is insane. The reason is, I believe, the first 126 numbers – may be negative as well, I’m not sure – are pre-allocated for performance reasons. Every time you literally say the number seven, that points to this pre-allocated flywheel pattern-type thing. But beyond that, these get allocated on demand, so you’re basically asking, ‘Is it the pointer to 257 equal to the other pointer 257?’ And there’s no longer this tracking between them and they get dropped.
00:00 So, there’s tons of this craziness going on here. It’s pretty fun.
00:00 OKKEN: Yeah, that’s nice.
00:00 KENNEDY: So, I think that this is a fun project. I really commend the people working in it. It’s great. I want to do something with this later, I just haven’t figured out quite what the details are yet but there’s gotta be something fun here.
00:00 So, this makes me feel like I should go practice my Python, like maybe I’m not as good as I thought I was. That dictionary thing going eight times took me for a loop for a bit.
00:00 OKKEN: Anything in the “WTF Python?” would be evil to try to bring up at a job interview.
00:00 KENNEDY: (Laughs) It’d be very evil. But if they answered it, think of that.
00:00 OKKEN: Oh, yeah. That’d be good.
00:00 I ran across this recent article called, “Python Exercises.” I’ve done this before, trying to brush up on Python skills or trying to find some questions to ask at an interview or something, trying to come up with some decent questions. A lot of the questions out there, they seem to be sort of generic questions around any language and they just happen to do it in Python. This is a collection of questions. Some of them are pretty easy to start off with, like basic syntax stuff. But there are some things that check just Python and some use of the standard library.
00:00 I think it’s a nice collection. It goes through syntax, of course, and some text processing and OS integration, decorators, generators. You can get into quite a few things. I think it’s a nice set, it’s not too huge. It’s a good one to look at.
00:00 KENNEDY: Yeah, and they don’t seem too trivial. They’re like given this set of data, parse it, go to the CSV file, process. Things like that. It’s pretty nice actually.
00:00 OKKEN: Yeah, and then the at the end, the last thing they talk about is testing, which I very much appreciate. I think it’s important. I’ve started to send out code examples before I bring somebody in for an interview, ask them to solve some coding problem but also to write a test to prove it works. I think that’s a good thing to add.
00:00 KENNEDY: Absolutely. That’s really cool that they include that at the end as well.
00:00 So, I’ve got another thing you should test for. Before I tell you about it, I want to tell you about Spaces.
00:00 Spaces is DigitalOcean’s new service which lets you basically store files on the internet and either privately or publicly pass them around. Kind of like Amazon S3 but much more affordable. Instead of charging you nine cents per gigabyte, they charge you one cent. You can use exactly the same tools. I use Transmit for my Mac; I love that to manage all my stuff in the Cloud. When I switched to DigitalOcean Spaces, which I did because I saw the offer, I’m like, ‘This is so much better’ before we even talked about this. I just pointed my Transmit at that. It just kept on working and it said, ‘Hey, there’s an S3 thing over here and here’s the key.’ If you are using S3 or some other sort of Cloud storage for file and things like that, you definitely should check out DigitalOcean Spaces at do.co/python and check it out. There’s a two month free trial then it’s really affordable and straightforward. I love it.
00:00 OKKEN: Nice.
00:00 KENNEDY: The audio you are listening to right now came straight out of there. Beautiful.
00:00 Have you heard of pickle?
00:00 OKKEN: Oh, yeah.
00:00 KENNEDY: Not the gherkins, but the built-in way to serialize stuff.
00:00 OKKEN: I don’t remember why, but I try to avoid it because I heard there’s problems.
00:00 KENNEDY: Yeah, there’s two major problems with pickle. One of them is it stores a binary representation of your objects. So, if you do things like rename a field maybe even reorder stuff, if you add a field, remove a field… There’s all sorts of stuff where just the versioning of your classes or your data, if that changes, you can no longer properly serialize these things. It’s not great. That can be a problem and that’s probably reason enough to use JSON or some other format. However, right in the documentation it says:
00:00 “Warning: The pickle module is not intended to be secure against erroneous or maliciously-constructed data. Never unpickle data received from an untrusted or unauthenticated source.”
00:00 I think people see this and they’re like, ‘Okay, that looks bad. Let’s get out of here.’ And they just bail. As they should; I think the versioning stuff alone is already an issue. I think there was an issue with somebody caching stuff and when they were switching from Python 2 to Python 3, the in-memory representation of daytime, or some part of the memory was a different representation and the picking stuff started to conflict with each other.
00:00 This article I want to talk about it is called, “Exploiting Misuse of Python’s Pickle.” If you’ve ever read that warning, and gone, ‘Huh, that sounds bad. I can kind of imagine what that might look like. I’m going to stay away from it.’ This shows you exactly how to do bad things.
00:00 Bad things begin with, ‘Let’s create a remote shell and start executing code and maybe even let us log in remotely over SSH to this machine by sending a little bit of binary data,’ like 50 bytes – something super small – over to this machine. I’ll just login and go from there.’ That sounds bad, right?
00:00 OKKEN: Yeah, jeez.
00:00 KENNEDY: When you unpickle something, there’s a few hooks where you can run arbitrary Python code. So, they say, ‘Well, let’s use subprocess.popen and create a chill for us.’ So, you just put that command in (dunder)_reduce_ I think it’s called and then you’ve got shells.
00:00 For those of you out there wondering, ‘What is this warning about? Exactly why should I be super scared?’ Here’s why. Great little example. Super approachable.
00:00 OKKEN: Wacky.
00:00 KENNEDY: Yeah, wacky. If I was running a Django website, I probably wouldn’t want to use that as my exchange format on my services, right?
00:00 OKKEN: No, and there are so many other, better formats anyway.
00:00 KENNEDY: JSON.
00:00 OKKEN: JSON.
00:00 KENNEDY: For sure. So, what do you have next for us?
00:00 OKKEN: I’ve got “A Complete Beginner’s Guide to Django.”
00:00 KENNEDY: Awesome.
00:00 OKKEN: This is a seven part series and it looks like six parts are done already and the seventh part is coming up soon it kind of goes through quite a bit of Django. I know there’s already a lot of Django tutorials out there. But the interesting thing I think that makes this one stand out is it has an academic feel to it, I think. If that’s your thing then you might like this.
00:00 KENNEDY: Well, it has a chalkboard, it has a beaker and it has a Superman flying, so these are all good signs. Yeah, well it has some comic drawings in it too and stuff. Actually, I think this is really nice. The graphics are wonderful. They’ve got little wire frames to help you design the web pieces, some nice graphics for file structure. It seems super approachable to me.
00:00 OKKEN: I kind of got lost with some of the UML diagrams and whatnot. It’s well written and people should check it out if they want to learn Django.
00:00 KENNEDY: Absolutely and it’s based on Python, not Legacy Python, so this is all good as well. If you’re looking to pick up Django, this is a good place to do it.
00:00 Do you remember when we talked about the malicious packages being uploaded to PyPI?
00:00 OKKEN: Yes.
00:00 KENNEDY: Do you remember what they were targeting? How were they getting people to install them?
00:00 OKKEN: Well, there were a couple ways. There were naming standard library things in PyPI and also misspellings.
00:00 KENNEDY: Exactly. So, we have a new GitHub project called, “pypi-parker.” This is a cool by a guy named Matt. He sent this over and said, ‘Hey, you should check this out.’ I don’t think a lot of people know about it yet but it’s really cool. The idea is that we had this debate about, ‘How do people check and verify what gets uploaded to PyPI?’ Should there be a committee that reviews it? All that sounded really bad. So, he’s created this library that says, ‘Look, the self-serve ability of people to just upload things to PyPI, this is a good thing. Let’s not get rid of it. Let’s just try to solve this typosquatting problem.’ So, what he’s done is he’s created this thing called the pypi-parker and it’s an extension to distutils. So, it’s a separate command that you can on it. If I was like Kenneth Reitz and I create a request, you do this and I could run a setup.py and give it, I think it’s park, and it will actually generate additional packages that I can upload to PyPI. And they’ll be the various reasonable misspellings of requests. When you import them it will raise an error in import error and says, ‘No, this thing that you pip installed? You misspelled that. Go get the real one over here.’ So, it gives them like a help message and all that kind of stuff.
00:00 It gives the ownership of these misspellings to the original package owner, and then for the people trying to accidentally use those, it will give them the warning to say, ‘You’ve misspelled this but here's what you actually should be looking for.’ I think that’s great.
00:00 OKKEN: Yeah. That’s cool.
00:00 KENNEDY: Well done, Matt. If you’re a package owner, check this out. It might be helpful.
00:00 OKKEN: Since I’m not writing so much anymore, I’m thinking about writing a couple new Open Source projects, so I’ll probably be in that boat soon.
00:00 KENNEDY: Nice. So, you should use pypi-parker and then give us a report. Awesome.
00:00 That’s our six items for the week, so hopefully everyone enjoyed them. Brian, what else is going on?
00:00 OKKEN: Well, I’m just getting ready for Halloween, actually.
00:00 KENNEDY: I know. Houses around here are getting scary. A lot of creatures and various cobwebs.
00:00 OKKEN: I have not been as busy as you have lately. What have you been up to?
00:00 KENNEDY: I have just released a brand new course and you can find it at freemongodbcourse.com. That should give you pretty much all you need to know about it.
00:00 So, I have this paid course, which is like a seven hour, super in-depth thing and I want to come up with a way for people to get started with Python, get started with MongoDB. Then if you want to learn more than you can take the paid course, or things like that. So, just drop over at freemongodbcourse.com and sign up. There’s really no strings attached, you just have to create and account and you can go take the class.
00:00 Another thing I wanted to point out, this is maybe not worth a whole item, this is not my thing it’s just something I saw. Donald Stuft who runs PyPI the website and all that kind of stuff, He sent out a tweet that said Python 3 usage has doubled in the past year according to download stats on PyPI.
00:00 OKKEN: That’s cool.
00:00 KENNEDY: Yeah. Legacy Python is definitely on the downward trend, even though it’s still the majority of things that get downloaded. Way to go, Donald for putting that out there. Nice to see that trend continuing.
00:00 Thank you everyone for listening. Brian, thanks for finding these things and sharing with everyone.
00:00 OKKEN: Thanks.
00:00 KENNEDY: Thank you for listening to Python Bytes. Follow the show on Twitter via @pythonbytes. Get the full show notes @pythonbytes.fm. If you have a news item you want featured, just visit pythonbytes.fm and send it our way. We’re always on the lookout for sharing something cool. On behalf of myself and Brian Okken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.