Brought to you by Michael and Brian - take a Talk Python course or get Brian's pytest book


Transcript #108: Spilled data? Call the PyJanitor

Return to episode page view on github
Recorded on Monday, Dec 10, 2018.

00:00 Hello and welcome to Python Bytes where we deliver Python news and headlines directly to your earbuds.

00:05 This is episode 108 recorded December 10th, 2018.

00:11 I'm Michael Kennedy.

00:12 And I'm Brian Okken.

00:12 And this episode is brought to you by DigitalOcean.

00:15 Check them out at pythonbytes.fm/digitalocean.

00:18 Tell you more about that later.

00:19 Right now, I wonder how you're doing, Brian?

00:20 I'm doing great.

00:21 Yeah?

00:22 Do you find that sometimes you end up with messy data?

00:26 So you kind of clean it up.

00:28 you're like, "Ah, gosh, not again." Yeah, you've got like, you know, empty spots and, you know, bad stuff.

00:34 Yeah, anyway, data.

00:36 Yeah, somebody spilled a string where a number was supposed to go.

00:38 Yeah.

00:39 Maybe there's something you can do.

00:40 Yeah, you can get a janitor.

00:41 A pyjanitor.

00:42 Clean up your janitor.

00:43 Yeah, first project we're going to talk about is pyjanitor.

00:47 It's a package for cleaning up your data.

00:50 And this has a history in, originally it was a port of an R package called janitor, but but now it's grown from that.

00:58 So it's both for cleanliness of data, but also just a really clean interface in convenient routines.

01:07 So it's kind of for anybody working with data that can have problems with it.

01:12 There's a whole bunch of stuff involved with this.

01:15 So some of the functionality includes cleaning up column names, which I'm not sure why you would have bad column names in the first place, but if you're pulling it from somewhere else.

01:25 - Yeah, like a lot of times people load CSVs into Panda data frames and things like that, I think.

01:29 - Yeah, okay, so cleaning those out.

01:31 Removing empty rows and columns and identifying duplicate entries.

01:35 There's some stuff that can just happen with data like that.

01:39 But it has a whole bunch of other cool things like telling your system how to deal with empty values and expanding columns and coalescing multiple columns into a single column and a whole bunch of stuff like that.

01:51 That's part of it is dealing with messy data.

01:54 But the other thing is to try to keep your code clean.

01:56 So on the other side of it, it has more of a functional programming style of using it.

02:02 And I'm not gonna try to really talk about this too much, but we have a code snippet in our show notes where it shows kind of how you would deal with data frames and doing things like dropping columns and stuff within pandas, and then how that would look in PyJanitor code.

02:17 And it just makes it a lot, I think it's more maintainable.

02:21 It's neat.

02:22 - Yeah, I really like this.

02:22 It looks super handy.

02:24 It's like a set of utilities on top of Pandas, which is great.

02:29 And I like how they describe this feature you just talked about.

02:31 It's like, it's a cleaner method changing verb-based API for common Pandas routines, otherwise known as a fluent interface.

02:40 If you're looking for one word there, but that's great.

02:44 So yeah, I think it's a really nice way to work with it.

02:46 It looks really approachable.

02:48 And I do like the fluent interface.

02:49 like I really, really wish more things like that operated that way.

02:53 - Yeah, for instance, a lot of the functions return data frames so that you can just keep on in a new function to do multiple stages of a workflow.

03:02 - Now this is all around pandas, so if you're doing not just regular data cleaning, but you gotta be working with pandas, but I think a lot of people who do that kind of work probably are, so it seems really helpful.

03:11 So would you consider yourself an expert at Python?

03:13 (laughing)

03:15 - No, I know enough to know what I don't know.

03:17 - Yeah, I think this is one of the things where it's like you're never really, like there's always stuff that you don't know, so you don't ever necessarily feel like an expert?

03:25 - Well, especially in this field, when we're like, I'm always researching stuff that know more than me about cool places, but all in all, I think I am, yeah.

03:33 - I would say that you are.

03:34 So I think there's a really interesting presentation done by James Powell, and this was recommended to us by one of our listeners, I just don't remember who, so I can't give him credit, but thank you for whoever sent this in.

03:44 And this is a presentation at PyData 2017 by James Powell.

03:51 And it's a YouTube video.

03:52 And it's not just a little light, here's my quick rundown of the five things.

03:57 It's like an hour and a half sort of deep dive into what it takes to be an expert.

04:02 So basically James says, hey look, it's pretty easy to be competent with Python.

04:08 You can learn a couple of things and whatever other programming language you use, you can kind of like make Python do that.

04:13 but to really understand it properly and take full advantage of it, write Pythonic code, things like that, is a whole lot harder.

04:21 So he runs through some of the things that he thinks people should know.

04:26 And it's really focused at the maybe advanced beginner, early intermediate type of developer who can do stuff with Python, but maybe stopped learning about the language and the features when they got whatever they're trying to make work, work.

04:40 Oh, good.

04:41 Yeah, so it covers things like the Python data model, otherwise known as the magic methods or nunder methods, meta classes, a bunch of other stuff.

04:49 And it's really nicely done.

04:51 I'm not a fan of presentations that are like, hey, here's seven slides and me talking about it.

04:55 Like, woohoo.

04:57 He just fires up an editor and says, I have no slides.

05:00 The editor is going to be the presentation.

05:02 Let's start talking about these things and just does it from scratch, which I think is a real genuine way to do it.

05:06 So well done.

05:07 - I'm gonna have to check that out myself.

05:08 - Yeah, yeah, I've watched some of it.

05:10 I haven't watched all of it.

05:11 I've only watched like maybe half.

05:13 But definitely watched enough to recommend it to folks.

05:15 If you feel like you're in the stage, like am I an expert?

05:18 I'm not sure.

05:18 Watch this and you'll probably get some things reinforced and others maybe be like yeah, I knew that, that's great.

05:25 I would even say it was awesome.

05:26 - That's very awesome.

05:28 There's a lot of awesomeness in Python and there's quite a few, on GitHub, there's quite a few different awesome lists.

05:37 And so that's what I wanna talk about today another awesome list. And this one is called the ISO. I saw these triangles. No, it's called awesome Python applications. It's kind of a just a way for you to try to highlight a bunch of different cool applications. Because if you're looking for packages to base your own project on, you can look at PyPI. But that's not as easy to do with applications because they don't exist in PyPI. So that's one of the why this has been created. There's quite a few categories already and Mahmoud Hashemi has started it and he wants people to help him out fill this in because it's kind of hard to find applications sometimes. So these are all applications written in Python that are open source that you can look at how they're doing things. Yeah I really like this because so often people say well I'd like to use Python for this project but to sell this to my teammates and my manager or my company it would be great to say well YouTube is written in Python, and I know you think Python doesn't scale, but I'm sure we're doing less than a million requests a second, so we'll probably be okay also.

06:45 Having the examples for those kinds of comparisons are really great.

06:48 This is a little bit like that.

06:49 There's a bunch of stuff for biology, cell profilers, and things like that, right?

06:55 >> Yeah.

06:56 Even, I had to look this one up, ERPs, Enterprise Resource Planning.

07:01 No, I don't need one of those, but cool, it's there.

07:04 But a lot of these, one of the things I like about this is there's a lot of custom applications that people end up writing, and they know that their problem space is very specific.

07:15 Instead of writing everything from scratch, you could take one of these open source projects and fork it or customize it for your own need.

07:23 And that's one of the benefits of open source, of course.

07:25 But good starting point.

07:26 Yeah, that's awesome.

07:27 I really like this.

07:28 Again, well done, Mahmoud.

07:31 And it's just, it's cool to have these examples out there, you know?

07:33 Yeah.

07:34 - I do, and if I had a cool application, I would like to put my application there in DigitalOcean.

07:40 Absolutely.

07:41 So DigitalOcean is sponsoring the show.

07:45 They've been sponsoring most episodes of Python Bytes, and they're big supporters of it.

07:49 So thank you to them.

07:51 We use them for some of our infrastructure, and it's working out great.

07:54 One thing I want to highlight this time around is their early access Kubernetes project.

07:59 So if you're doing anything with Docker and Kubernetes and things like that, they have some special tools for deploying and managing your containers in the cloud.

08:10 So just go over to pythonbytes.fm/digitalocean and you can sign up there or go over to the products and just pick Kubernetes and get started on that.

08:18 There's tons of other stuff that you can do as well, but the Kubernetes work they're doing is quite cool.

08:23 - Very cool, nice.

08:24 - Indeed.

08:25 So the next one I wanna talk about is something we haven't covered a ton on the show, but I think has some interesting shadows and parallels with Python itself, and that is some governance around Django itself.

08:39 So there's an article called Django Core No More by James Bennett.

08:44 So this is not core as in like some library, this is core as in core developers.

08:49 So Django's been around for a long time, 2005 onward I believe, and it's obviously a very polished professional web framework, one of the most, if not the most, it's in there fighting with Flask, for that title, but one of the most popular Python frameworks, lots of amazing apps are built on it.

09:08 But what they're finding is, say, actually, Django as a open source project is not recruiting enough active contributors.

09:17 That's surprising, right?

09:18 - Yeah, it is.

09:18 - They say one of the reasons they think this is not working so well is they feel like there's these people called Django core developers, and then there's everyone else.

09:29 And if you're not a core developer, well, you probably don't have any business messing around with Django or submitting any fixes or anything.

09:35 Maybe you'll tell a core developer and they can go do it, right?

09:38 But not do it themselves.

09:41 So the proposal in summary is more or less to abolish this concept of a core developer altogether.

09:48 So that when people come to look at Django, they don't go, "Oh, there's this special group of selected core developers and then everyone else." So instead, what they found was in practice, these core developers all had this straight commit bit.

10:02 They could just commit straight to the repo and just have stuff happen, but no one was doing that.

10:07 They were all creating pull requests and having a conversation around their changes anyway.

10:12 And that's how somebody would make a contribution to Django from the outside.

10:16 So they said, let's have a more democrat, not democratic, more spread out, even way of talking about people contribute to Django so that people are more likely to come and make contributions.

10:30 >> Okay. So they'll still have some sort of process for deciding on which pull requests to do, right?

10:36 >> Yeah. So now they're going to have two different groups of people who are formalizing the stuff.

10:41 There's mergers and releasers who would respectively merge PRs and then package it up and release it.

10:49 So these are more like bureaucratic roles like finalizing it.

10:54 But the idea is to have PRs and open discussions around issues and PRs.

10:57 And then these folks kind of say, "Yeah, okay, we're all good with this." Interesting.

11:01 Yeah.

11:01 Yeah.

11:01 So it's a little bit of a parallel of Guido stepping back and saying, "Okay, everybody, you guys got to spread out some of this decision making and not just leave it all on my back." Yeah, I like what they're doing.

11:12 I also like doing a lot of this stuff in the open and having the governance models be sort of an open discussion so that different groups can learn from it.

11:20 So, like for instance, I was listening to your interview about SANIC, and they were talking about basically still figuring out how to govern the SANIC project.

11:33 And so, doing all this in the open and having everybody be able to give feedback and stuff is cool.

11:38 Yeah, it's definitely cool.

11:39 Now, I don't believe this is the way that things are.

11:42 This is a proposal for the way that James and folks wants this to be.

11:47 So, kind of take it in that sense, right?

11:48 It is not an official decision as far as I know, but this is the proposal.

11:52 - Okay, neat.

11:52 - Yeah, cool.

11:53 Speaking of Django, what'd you got?

11:55 What's next?

11:56 - Yeah, I wanted to shoo this in.

11:58 I think somebody mentioned this on Twitter.

12:01 Again, I can't remember who, sorry, but thanks for everybody for giving us tips on things.

12:06 There is a Django template that is called the WeMake Django template.

12:11 I think, actually, I don't really know who WeMake is.

12:14 I think WeMake is a group that does like customer websites and stuff.

12:18 - Let me just say really quick, this is not templates as in Django templates or Jinja 2 or Chameleon.

12:23 This is like I'm making a project from scratch.

12:27 Like it generates the project structure, like a project template, not a web HTML template, right?

12:31 - Right. - Okay.

12:32 - I guess it should be called a cookie cutter because it is based on cookie cutter.

12:36 - More or less.

12:37 - Yeah, so it's based on cookie cutter.

12:38 So you can use, if you, I'm sure everybody's familiar with cookie cutter.

12:42 You use it to start a project and it pulls stuff off of GitHub and initializes your project and then asks you a bunch of questions.

12:50 But it has a whole bunch of really cool things that you might not actually think to do right away in a Django project.

12:57 They're saying that it's more for larger projects, but I'm sure that a lot of these are, you could do them for smaller projects too, but it uses a system called Dependabot, which I hadn't heard of before.

13:10 But it's one of those systems to keep your dependencies up to date.

13:14 It's got poetry for package management, which is kind of neat.

13:18 pytest for testing, of course, that's awesome.

13:20 One of the reasons why I think this is neat because Django doesn't do pytest automatically, so having somebody initialize that and set it up for you is cool.

13:29 Including things, and then there's some of the other things are mypy for static typing, pre-commit hooks already set up, Flake 8 and an extension to the style guide already built in, so you can use that as a template to use your own style guide.

13:43 and a whole bunch of other cool things like Docker integration already, GitLab CI for building and testing.

13:50 And then something that I hadn't heard of before, which is CADI, which is, I'm gonna probably get this wrong, but I think it's something to do with secure web sockets or something, I don't know.

14:01 HTTPS, whatever that is.

14:04 (laughing)

14:06 - Sounds good.

14:07 Yeah, I don't know what CADI is either.

14:08 I should check it out, but it looks pretty cool.

14:11 And yeah, I think if you're creating Django projects, I think, or any form of web project, there's some for Flask, there's some for Pyramid, there's some for Django.

14:19 I think looking at these more full featured, more structured starter cookie cutters are really valuable.

14:25 And I think actually the biggest value comes to people using Flask, by the way.

14:30 The reason I say that is Django already has a structure.

14:33 There's a lot of structure, like static files go here, whatever.

14:35 A lot of stuff is set up when you create a site with Django.

14:38 Same thing with Pyramid, you already use cookie cutter templates, but Flask is like, well, you create a file and then you're on your own.

14:45 You know, so like all that structure is like not anywhere to be seen, but it's still going to have to exist on real apps eventually.

14:51 And so having some projects that you can follow, I think it's really great.

14:54 - Yeah, so that's a good segue, not a segue into the next one, but I would love it if people would share with us some of their favorite Flask cookie cutter starter projects.

15:05 - Yeah, for sure.

15:06 We can give them a little shout out in the extra section or something.

15:09 So you want to make it just a straight three of a kind for Django in a row here?

15:13 Let's just wrap it up with Django.

15:14 Yeah, we're already here.

15:15 So you've gone and you've created your project with one of these Django templates.

15:21 You've got it working, you've done some testing, maybe something wasn't working, so you flipped it into debug mode, that's cool.

15:27 Maybe set some other stuff.

15:29 And then you're like, all right, ready to push it out.

15:31 It has this template you told me about here.

15:34 It already has integrated deploy steps into the CI build.

15:39 So that's pretty cool, you just type deploy and boom, off you go.

15:42 And then a little bit later, something starts happening to your AWS account or your database records that is not so good.

15:50 That might be because you left the debug mode or some other setting on that exposes all sorts of information.

15:58 So there are ways to run Django that is helpful for development, but then you obviously don't want to share that information with everyone else.

16:06 So there's this project called Django Hunter, and it looks for insecure djangos.

16:11 So if you deploy your app, you can point this thing at it and ask it, what's the status of this thing here?

16:17 So the person who wrote it said, why did we create this?

16:20 Well, it's a tool to help identify incorrectly configured Django apps that are exposing sensitive information.

16:26 For example, in March 2018, there was 28,165, it's a weird way to write it, it says 28,165,000 Django servers.

16:35 Is that 28 million?

16:37 I'm just going to say there's a lot of Django servers that are exposed on the internet, showing off things like their AWS keys, their database passwords and connection strings, et cetera.

16:46 That you don't want.

16:47 So there's this cool tool called Django Hunter, and you can basically point it at your projects, and it will tell you if something's going wrong with them.

16:55 - That's cool.

16:56 I love projects like this, because Python's so easy to get started on things.

17:01 You can, I guess, jump into the deep end before you're quite ready and having tools like this to help you jump in safely with it.

17:07 - Yeah, absolutely.

17:09 So, you know, it's easy for people to say, well, that was sure stupid, you got hacked because you didn't, you know, set the debug mode to false.

17:16 Well, if you're struggling to figure out, like, what does deployment mean?

17:20 Like, I can't even barely get this thing to run on a web server, and I'm trying to understand Linux and databases and firewalls.

17:27 Like, it's pretty easy to overlook that kind of stuff when you're like struggling to just make the thing work.

17:34 There's a lot of these settings that like, you're like, I just wanna test this out and show it to somebody.

17:38 I'm not, I haven't been running Django for 10 years.

17:41 So there was actually a conversation either on Twitter or on Reddit about this.

17:46 And somebody said, yeah, this is great.

17:48 Like this guy jumped in and says, hey, I probably one of those 20,000 servers that are among my early projects that's on Heroku.

17:57 I accidentally exposed my AWS password and all hell broke loose.

18:01 The problem is as a beginner, it's not obvious how to separate development and production settings and keep that stuff out of your public repo, of course.

18:09 Yeah, the other thought was somebody said, there's a reasonable argument to be made that debug should be set to false by default.

18:17 If you turn it on, then maybe you know about it so you know to turn it off.

18:20 But if you never turn it on, how do you know?

18:22 There's a setting, there's a huge comment right by where the setting is.

18:25 It says, never put this in production with debug equals true.

18:28 but it's like in a settings file you might not ever open.

18:30 So if you don't look at it, that's bad.

18:32 So anyway, there's some interesting maybe thoughts around what to do to Django to make it better, but certainly having a tool to tell you if something is wrong is good.

18:43 All right, so Django Hunters for those Django developers, or DevOps running Django, people running Django servers.

18:50 (laughing)

18:52 That was good, and that was all of our items.

18:54 I wanna first give a quick shout out to you, Brian, in our extra section.

18:58 Thanks for having me on your show.

18:59 And I blogged about that, so I put a link to the blog.

19:02 But we had a great time talking about what it takes to be a good podcast guest and how to prepare for that, which is more broad than just podcast guests, I guess.

19:09 - Yeah, and we've actually already gotten a whole bunch of positive feedback on that episode, so I'm glad we did it.

19:15 - Yeah, great.

19:16 - Anything else extra?

19:17 - You know, I was just thinking Christmas is coming up.

19:20 Just, you know, at least in the United States, there's this weird tradition, but it is a thing.

19:27 where in shopping malls, a Santa will be hired, a Santa Claus, and will sit there, and there's typically photographers around, and parents will bring their children to the Santa, and the purpose is the child sits on the lap of the Santa, asks for something, probably totally unreasonable, and they take pictures of it, of the whole situation, and hopefully the kid doesn't cry and get afraid of Santa.

19:49 So, you have a good version of this, right?

19:51 - Yeah, I'm gonna read it out loud because it's a comic, But it's hilarious.

19:57 And it was posted by ChangeLog on Twitter.

20:01 This little girl is sitting on Santa's lap, and she says, "For Christmas, I want a dragon." Of course, Santa says, "Be realistic." Okay, I want enough donations to support my open source work with a response of, "What color do you want your dragon?" (laughing)

20:16 - That's so awesome.

20:17 I really love it.

20:18 It's sad that it's true, but yeah, it's pretty awesome.

20:21 Be realistic.

20:22 What color do you want your dragon to be?

20:23 - All right, well, I thought I'd throw one in here for you as well.

20:27 It has nothing to do with any seasonal stuff, so this is good year round.

20:31 Has more to do with race conditions, deadlocks, and that sort of weird timing problems you run into with multithreading, yeah?

20:38 So, you've heard the joke, why did the chicken cross the road, which has all sorts of weird answers, but sometimes just to get to the other side, right?

20:44 Why did the multithreaded chicken cross the road?

20:47 - I don't know, why?

20:48 - Road the side get to the other of the two.

20:51 Ask me again.

20:51 - So why did the multi-throated chicken cross the road?

20:54 - The side of to the road other get.

20:56 (laughing)

20:57 There's always ground love, it's always different.

21:00 - I love it.

21:01 - That concludes the joke section, I suppose.

21:03 - Yeah, so yeah, we'd also like, we're both sort of silly people, and we'd like to have some feedback as well from people to see whether or not we should keep a joke or two in the episodes, or actually just whether or not we should.

21:18 So that'd be great.

21:19 - Yeah, it sounds good, and if you have good jokes.

21:20 Yeah, send them.

21:21 Send them.

21:22 All right.

21:23 Well, Brian, thank you for doing this and everyone, thank you for listening.

21:25 Thank you.

21:26 Bye.

21:27 Thank you for listening to Python Bytes.

21:28 Follow the show on Twitter via @pythonbytes.

21:31 That's Python Bytes as in B-Y-T-E-S.

21:34 And get the full show notes at pythonbytes.fm.

21:37 If you have a news item you want featured, just visit pythonbytes.fm and send it our way.

21:42 We're always on the lookout for sharing something cool.

21:44 On behalf of myself and Brian Okken, this is Michael Kennedy.

21:48 Thank you for listening and sharing this podcast with your friends and colleagues.

Back to show page