Brought to you by Michael and Brian - take a Talk Python course or get Brian's pytest book


« Return to show page

Transcript for Episode #108:
Spilled data? Call the PyJanitor

Recorded on Monday, Dec 10, 2018.

00:00 KENNEDY: Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is Episode 108 recorded December 10, 2018. I'm Michael Kennedy.

00:12 OKKEN: And I'm Brian Okken.

00:13 KENNEDY: And this episode is brought to you by DigitalOcean. Check them out at pythonbytes.fm/digitalocean. Tell you more about that later. Right now I want to know how you doing, Brian?

00:20 OKKEN: I'm doing great.

00:21 KENNEDY: Yeah? Do you find that sometimes you end up with messy data, so you got to clean it up? You're like oh gosh, not again.

00:30 OKKEN: Yeah, I've got, you know, empty spots and bad stuff. Yeah, anyway, data.

00:36 KENNEDY: Yeah, somebody spilled a stray word where a number was supposed to go. Maybe there is something you can do.

00:40 OKKEN: Yeah, you can get a janitor.

00:42 KENNEDY: A pyjanitor.

00:43 OKKEN: Clean up your janitor, yeah. First project we're going to talk about is pyjanitor. It's a package for cleaning up your data. And this has a history in, originally it was a port of an R package called Janitor, but now it's grown from that. It's both for cleanliness of data, but also just a really clean interface in convenient routines. So, it's kind of a, for anybody working with data that can be, can have problems with it. There's a whole bunch of stuff involved with this. So, some of the functionality includes cleaning up column names, which I'm not sure why you would have bad column names in the first place, but if you're pulling it from somewhere else.

01:24 KENNEDY: Yeah, like a lot of times people load CSVs into pandas DataFrames and things like that, I think.

01:29 OKKEN: Yeah, okay. So, cleaning those out. Removing empty rows and columns and identifying duplicate entries. There's some stuff that can just happen with data like that, but it has a whole bunch of other cool things like telling your system how to deal with empty values, and expanding columns, and coalescing multiple columns into a same column, and a whole bunch of stuff like that. That's part of it, is dealing with messy data, but the other thing is to try to keep your code clean. So, on the other side of it it has a more of a functional programming style of using it, and not going to try to really talk about this too much, but we have, a code snip it in our show notes where it shows kind of how you would deal with DataFrames and doing things like dropping columns and stuff within pandas, and then how that would look in pyjanitor code. And it just makes it a lot, I think it's more maintainable.

02:21 KENNEDY: Sweet. Yeah, yeah. I really like this. It looks super handy. It's like a set of utilities on top of pandas which is great. And I like how they described this feature you just talked about. It's like a cleaner, method chaining, verb-based API for common pandas routines, otherwise known as a fluent interface. If you're looking for one word there. But, that's great. So, yeah, I think it's a really nice way to work with it. It looks really approachable, and I do like the fluent interface, like I really, really wish more things like that operated that way.

02:53 OKKEN: Yeah, for instance, a lot of the functions return DataFrames, so that you can just keep on in the new function to do multiple stages of a workflow.

03:02 KENNEDY: Now, this is all around pandas, so if your doing not just regular data cleaning, but you got to be working with pandas. But I think a lot of people who do that kind of work probably are, so it seems really helpful. So, would you consider yourself an expert at Python?

03:15 OKKEN: No, I know enough to know that, what I don't know.

03:17 KENNEDY: Yeah, I think this is one of the things where it's like your never really ... There's always stuff that you don't know, so you don't ever necessarily feel like an expert.

03:25 OKKEN: Well, especially in this field, when we're like, I'm always researching stuff that know more about, more than me about cool places. But, all in all I think I am, yeah.

03:33 KENNEDY: I would say that you are. So, I think there's a really interesting presentation done by James Powell. This was recommended to us by one of our listeners. I just don't remember who, so I can't give them credit. But thank you, for whoever sent this in. And this is a presentation at PyData 2017 by James Powell, and it's a YouTube video. And it's not just a little light here's my quick rundown of the five things. It's like an hour and a half sort of deep dive into what it takes to be an expert. So, basically James says, Hey, look, it's pretty easy to be competent with Python. Right? You can learn the couple of things and whatever other programming language you use, you can kind of make Python do that. But, to really understand it properly and take full advantage of it, write Pythonic code, things like that, is a whole lot harder. So, he runs through some of the things that he thinks people should know, and it's really focused at the maybe advanced beginner/early intermediate type of developer who can do stuff with Python but maybe stopped learning about the language and the features when they got whatever they were trying to make work, work.

04:40 OKKEN: Oh, good.

04:41 KENNEDY: Yeah, so it covers things like the Python data model, you know otherwise known as the magic methods or dunder methods, Metaclasses, a bunch of other stuff, and it's really nicely done. You know, I'm not a fan of presentations that are like, hey, here's seven slides and me talking about it, like woo hoo. You know, he just fires up an editor and says, "I have no slides. The editor is going to be the presentation. Let's start talking about these things." And just does it from scratch, which I think is a real genuine way to do it. So, well done.

05:07 OKKEN: I'm going to have to check that out myself.

05:08 KENNEDY: Yeah, yeah. I watched some of it, I haven't watched all of it. I've only watched like maybe half, but definitely watched enough to recommend it to folks. If you feel like your in this stage, like "Am I an expert? I'm not sure." Watch this and you'll probably get some things reinforced and others maybe be like, yeah, I knew that, that's great. Yeah. I would even say it was awesome.

05:27 OKKEN: That's very awesome. There's a lot of awesomeness in Python, and there's quite a few, there's quite a few different awesome lists. And so, that's what I wanted to talk about today, is another awesome list. And this one is called the Isosceles triangles. No, it's called Awesome Python Applications. It's kind of just a way for you to try to highlight a bunch of different cool applications, because if you're looking for packages to base your own project on, you can look at PyPI. But, that's not as easy to do with applications 'cause they don't exist in PyPI, so that's one of the, why this has been created. There's quite a few categories already and Mahmoud Hashemi has started it and he wants people to help him out, fill this in, because it's kind of hard to find applications sometimes. So, these are all applications written in Python that are open-sourced, that you can look at how they're doing things.

06:23 KENNEDY: Yeah, I really like this because so often people say, well, I'd like to use Python for this project but to sell this to my teammates and my manager, or my company, it would be great to say, well, YouTube is written in Python and I know you think Python doesn't scale, but I'm sure we're doing less than a million requests a second, so we'll probably be okay, also. Yeah, having the examples for those kinds of comparisons are really great, so this is a little bit like that. There's a bunch of stuff for like biology, you know, and like cell profilers and things like that, right?

06:55 OKKEN: Yeah, and even like I had to look this one up. ERPs, Enterprise Resource Planning. No, I don't need one of those, but cool, it's there. But a lot of these, one of the things I like about this is there's a lot of custom applications that people end up writing and they know that their problem space is very specific. And instead of writing everything from scratch, you could take one of these open-source projects and fork it or customize it for your own need. I mean, that's one of the benefits of open-source, of course, but good starting point.

07:26 KENNEDY: Yeah, that's awesome. I really like this. Again, well done, Mahmoud. And it's just cool to have these examples out there, you know?

07:33 OKKEN: Yeah, do you know what else is cool?

07:34 KENNEDY: I do and if I had a cool application, I would like to put my application there in DigitalOcean. Absolutely. So, DigitalOcean is sponsoring the show. They've been sponsoring most the episodes of Python Bytes and they're big supporters of it, so thank you to them. We use them for some of our infrastructure and it's working out great. One thing I want to highlight this time around is their early access Kubernetes project. So, if you're doing anything with Docker and Kubernetes and things like that, they have some special tools for deploying and managing your containers in the Cloud. So, just go over to pythonbytes.fm/digitalocean, you can sign up there or go over to the products and just pick Kubernetes and get started on that. There's tons of other stuff that you could do as well, but Kubernetes work they're doing is quite cool.

08:23 OKKEN: Very cool, nice.

08:24 KENNEDY: Indeed. So, the next one I want to talk about is something we haven't covered a ton on the show, but I think has some interesting shadows and parallels with Python itself. That is some governance around Django itself. So, there's an article called Django: Core no more by James Bennett. So, this is not the core as in some library, this is core as in core developers. So, Django's been around for a long time, 2005 onward, I believe. It's obviously a very polished, professional web framework, one of the most, if not the most. You know, it's in there fighting with Flask for that title, but one of the most popular Python frameworks, lots of amazing apps are built on it. But what they're finding is they actually, Django as a open-source project is not recruiting enough active contributors. That's surprising, right?

09:17 OKKEN: Yeah, it is.

09:18 KENNEDY: They say one of the reasons they think this is not working so well is they feel like there's these people called Django core developers, and then there's everyone else. And if you're not a core developer, well, you probably don't have any business messing around with Django or submitting any fixes or anything. Maybe you'll tell a core developer and they can go do it. Right? But not do it themselves. So, the proposal in summary is more or less to abolish this concept of a core developer all together. Okay. So, that when people come to look at Django they don't go, oh, there's a special group of selected core developers and then everyone else. So, instead what they found was in practice these core developers all had this straight commit bit. They could just commit straight to the repo and to have stuff happen, but no one was doing that. They were all creating pull requests and having a conversation around their changes anyway. And that's how somebody would make a contribution to Django from the outside. So, they said let's have a more democratic, not democratic, a more spread out, even way of talking about people who contribute to Django so that people are more likely to come and make contributions.

10:30 OKKEN: Okay. So, they still have, they'll still have some sort of process for deciding on which pull requests to do, right?

10:36 KENNEDY: Yeah, so now they're going to have two different groups of people who are formalizing the stuff. There's mergers and releasers, who would, respectively, merge PRs and then package it up and release it. So, these are more bureaucratic roles like sort of finalizing it, right? But the idea is to have PRs and open discussions around issues and PRs, and then these folks kind of say, yeah, okay, we're all good with this.

11:00 OKKEN: Interesting.

11:02 KENNEDY: So, it's a little bit of a parallel of Guido stepping back and saying, "Okay, everybody, you guys got to spread out some of this decision-making and not just, you know, leave it all on my back."

11:11 OKKEN: Yeah, I like what they're doing. I also like doing a lot of this stuff in the open and having the governance models be sort of an open discussion, so that different groups can learn from it. Like, for instance, I was listening to your interview about Sanic and they're talking about basically still figuring out how to govern the Sanic project. And so, doing all this in the open and having everybody be able to give feedback and stuff, it's cool.

11:38 KENNEDY: Yeah, it's definitely cool. Now, I don't believe this is the way that things are. This is a proposal for the way that James wants this to be. So, kind of take it in that sense, right? This is not an official decision as far as I know, but this is the proposal.

11:52 OKKEN: Okay, neat.

11:53 KENNEDY: Yeah, cool. Speaking of Django, what do you got? What's next?

11:56 OKKEN: Yeah, well, to shoo this in. I think somebody mentioned this on Twitter, again, I can't remember who. Sorry, but thanks for everybody for giving us tips on things. There is a Django template that is called the wemake-django-template. I think, actually, I don't really know who wemake is. I think wemake is a group that does like customer websites and stuff.

12:18 KENNEDY: Let me just say really quick. This is not templates as in Django templates, or Jinja2, or Chameleon. This is like I'm making a project from scratch. It generates the project structure, like a project template, not a web HTML template, right?

12:31 OKKEN: Right. I guess it should be called a Cookiecutter because it is based on Cookiecutter.

12:36 KENNEDY: More or less.

12:37 OKKEN: Yeah, so it's based on Cookiecutter, so you can use if you, I'm sure everybody's familiar with Cookiecutter. You use it to start a project and it pulls stuff off of GitHub and initializes your project, and then asks you a bunch of questions. But it has a whole bunch of really cool things that you might not actually think to do right away in a Django project. They're saying that it's more for larger projects, but I'm sure that a lot of these you could do 'em for smaller projects too. But it uses a system called Dependabot.

13:08 KENNEDY: That's an interesting one.

13:09 OKKEN: Which I haven't heard of before. But it's one of those systems to keep your dependencies up to date. It's got Poetry for package management, which is kind of neat.

13:16 KENNEDY: Mm-hmm.

13:18 OKKEN: Pytest for testing, of course, that's awesome. One of the reasons why I think this is neat because Django doesn't do pytest automatically, so having somebody initialize that and set it up for you is cool. And then, there's some of the other things. There are mypy for static typing, pre-commit-hooks already setup, Flake8, and an extension to the style guide already built in so you can use that as a template to use your own style guide. And a whole bunch of other cool things like Docker integration already, GitLab CI for building and testing. And then, something that I hadn't heard of before which is Caddy, which is, I going to probably get this wrong, but I think it's something to do with secure websockets or something, I don't know. Https, whatever that is.

14:06 KENNEDY: Sounds good. Yeah, I don't know what Caddy is either. I should check it out, but looks pretty cool. And, yeah, I think if you're creating Django projects, I think, or any form of web project, there's some for Flask, there's some for Pyramid, there's some for Django. I think looking at these more full featured, more structured, starter Cookiecutters are really valuable. And I think actually the biggest value comes to people using Flask. By the way. The reason I say that is, Django already has a structure. Right? There's a lot of structure like static files go here, whatever, a lot of stuff is set up when you create a site with Django. Same thing with Pyramid, you already use Cookiecutter templates. But Flask is like, well, you create a file, and then you're on your own. You know, so like with all that structure is like not anywhere to be seen, but it's still going to have to exist on real apps eventually. And so, having some projects that you can follow, I think it's really great.

14:54 OKKEN: Yeah, so that's a good segue. Not a segue into the next one, but I would love it if people would share with us some of their, some of their favorite Flask/Cookiecutter starter projects.

15:05 KENNEDY: Yeah, for sure. We can give them a little shout out in the extra section or something. So, you want to make it just a straight three of a kind for Django, in a row here? Let's just wrap it up with Django.

15:14 OKKEN: Sure, whatever.

15:15 KENNEDY: Yeah, we're already here. So, you've gone and you've created your project with one of these Django templates. You've got it working, you've done some testing. Maybe something wasn't working, so you flipped it into debug mode, that's cool. Maybe set some other stuff. And then, you know, all right, ready to push it out. It has like this template you told me about here. It already has integrated deploy steps into the CI build. So, that's pretty cool. You just type deploy and, boom, off you go. And then, a little bit later something starts happening to your AWS account or your database records that is not so good. That might be because you've left the debug mode, or some other setting on, that exposes all sorts of information. So, there are ways to run Django that is helpful for development, but then you obviously don't want to share that information with everyone else. So, there's this project called django-hunter, and it looks for insecure Djangos. So, if you deploy your app, you point this thing at it and ask it what's the status of this thing here. So, the person who wrote it said, "Why did we create this?" Well, it's a tool to help identify incorrectly configured Django apps that are exposing sensitive information. For example, in March 2018, there was 28,165, that's a weird way to write it. It says 28,165 thousand Django servers. Is that 28 million? I'm just going to say there's a lot of Django servers that are exposed on the internet showing off things like their AWS keys, their database passwords, and connection strings, et cetera. That you don't want.

16:47 OKKEN: Right.

16:48 KENNEDY: So, there's this cool tool called django-hunter and you can basically point it at your projects and it will tell you if something's going wrong with them.

16:55 OKKEN: That's cool. I love projects like this because Python's so easy to get started on things. You can, I guess, jump into the deep end before your quite ready and having tools like this to help you jump in safely, with the... It's good.

17:09 KENNEDY: Absolutely. So, you know, it's easy for people to say, "Well, that was sure stupid." You got hacked because you didn't, you know, set the debug mode to false. Well, if you're struggling to figure out like what does deployment mean, like I can't even barely get this thing to run on a Web server, and I'm trying to understand Linux, and databases, and firewalls. It's pretty easy to overlook that kind of stuff when you're struggling to just make the thing work. Right? There's a lot of these settings that you're like I just want to test this out and show it to somebody. I'm not, you know what, I haven't been running Django for ten years. So, there was actually a conversation, either on Twitter or on Reddit, about this. Somebody said, yeah, this is great. This guy jumped in and said, "Hey, I'm probably one of those 28,000 servers that are among my early projects that's out on Heroku. I accidentally exposed my AWS password and all hell broke loose." The problem is as a beginner, it's not obvious how to separate development and production settings and keep that stuff out of your public repo. The other thought was somebody said, you know, "There's a reasonable argument to be made that debug should be set to false by default." If you turn it on then maybe you know about it, so you know to turn it off. But if you never turn it on, how do you know? Right? There's a setting, there's a huge comment right by where the setting is that says never put this in production with debug=true. But it's like in a settings file you might not ever open.

18:30 OKKEN: Right.

18:31 KENNEDY: So, if you don't look at it, you know, that's bad. So, anyway, there's some interesting maybe thoughts around what to do to Django to make it better, but certainly having a tool to tell you if something is wrong is good.

18:42 OKKEN: Yeah, okay.

18:43 KENNEDY: All right, so django-hunters for those Django developers, or DevOps running Django, people running Django servers. That was good and that was all of our items. I want to first give a quick shout out to you, Brian, in our extras section. Thanks for having me on your show. And I blogged about that, so I put a link to the blog. But we had a great time talking about what it takes to be a good podcast guest and how to prepare for that, which is more broad than just go podcast guest, I guess.

19:09 OKKEN: Yeah, and I've actually already gotten a whole bunch of positive feedback on that episode, so I'm glad we did it.

19:15 KENNEDY: Yeah, great.

19:16 OKKEN: Anything else extra?

19:16 KENNEDY: You know, I was just thinking Christmas is coming up. Just a... You know, at least in the United States, there's this weird tradition, but it is a thing, where in shopping malls, a Santa will be hired, a Santa Claus, and will sit there and there's typically photographers around. And parents are bringing their children to the Santa and the purpose is the child sits on the lap of the Santa, asks for something probably totally unreasonable, and they take pictures of it. Of the whole situation, and hopefully the kid doesn't cry and get afraid of Santa. So, you have a good version of this, right?

19:51 OKKEN: Yeah, I'm going to read it out loud because it's a comic, but it's hilarious. And it was posted by Changelog on Twitter. This little girl is sittin' on Santa's lap and she says, "For Christmas I want a dragon." Of course, Santa says, "Be realistic." "Okay, I want enough donations to support my open-source work." With a response of, "What color do you want your dragon?"

20:16 KENNEDY: That's so awesome. I really love it. It's sad that it's true, but yeah, it's pretty awesome. Be realistic. What color you want your dragon to be? All right, well, I thought I'd throw one in here for you as well. It has nothing to do with any seasonal stuff, so this is good year round. Okay. Has more to do with race conditions, deadlocks, and that sort of weird timing problems you run into with multi-threading, yeah? So, you've heard the joke 'Why did the chicken cross the road', which has all sorts of weird answers but sometimes just to get to the other side, right? Why did the multi-threaded chicken cross the road?

20:47 OKKEN: I don't know, why?

20:47 KENNEDY: Road the side get to the other of the to. Ask me again.

20:51 OKKEN: Why did the multi-threaded chicken cross the road?

20:54 KENNEDY: The side of to the road other get. It's always scrambled, it's always different.

21:00 OKKEN: I love it.

21:01 KENNEDY: That concludes the joke section, I suppose.

21:03 OKKEN: Yeah, so yeah, we'd also, like we're both sort of silly people and would like to have some feedback as well from people to see whether or not we should keep a joke or two in the episodes or actually just whether or not we should. So, that'd be great.

21:19 KENNEDY: Yeah, sounds good. And if you have good jokes.

21:20 OKKEN: Yeah, send 'em.

21:21 KENNEDY: Send 'em. All right, well, Brian, thank you for doing this and everyone, thank you for listening.

21:25 OKKEN: Thank you.

21:26 KENNEDY: Mm-hmm, bye.

21:27 OKKEN: Thank you for listening to Python Bytes. Follow the show on Twitter via @Pythonbytes. That's Python Bytes as in b-y-t-e-s. And get the full show notes at pythonbytes.fm. If you have a news item you want featured, just visit pythonbytes.fm and send it our way. We're always on the lookout for sharing something cool. On behalf of myself and Brian Okken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.

Back to show page