Transcript #25: Could we have more in-database machine learning please?
Return to episode page view on github00:00 Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds.
00:05 This is episode 25, recorded May 10th, 2017.
00:11 I'm Michael Kennedy.
00:12 And I'm Brian Okken.
00:13 And we've gathered up a bunch of cool Python things to share with you this week.
00:16 So Brian, I want to start with some news coming out of Microsoft's biggest developer conference this year.
00:23 There's some actually Python news, which is kind of cool.
00:26 You wouldn't expect that, right?
00:27 Right.
00:27 Yeah, they actually did a whole section on machine learning and AI.
00:31 And it's a very cool thing.
00:32 But what I want to talk about is one of the biggest databases in the corporate space, the most popular ones, is Microsoft SQL Server.
00:40 So the thing that I want to point out or talk about is they've just announced a very interesting feature.
00:48 And I'm kind of hoping other database providers copy this like straight away.
00:52 So what they've announced is in-database machine learning.
00:56 Wow.
00:57 So, yeah, it's crazy, right?
00:58 Like, wait a minute.
00:59 What does in-database machine learning even mean?
01:01 Yeah, exactly.
01:02 So here's the idea.
01:03 Like, if you're going to transfer a lot of data, machine learning or otherwise, and you've got one server over here with your data and another server that's executing it, then you've got the network latency.
01:16 You've got the crossing process boundaries.
01:17 You've got all sorts of latency working there.
01:20 So especially if you've got the So especially if you have a chatty API, this can be problematic.
01:25 But what this new feature is, they have now built the ability to run CPython 3.5 in process in SQL Server.
01:33 And you can install external packages.
01:36 It comes built in with some of the machine learning packages already there.
01:40 It runs a subset of the Anaconda distribution included right there.
01:45 So inside your database, you can basically install Python scripts and do full-on machine learning with zero latency to your data.
01:53 That's pretty cool.
01:54 I think it's really cool.
01:55 You might have to go back to teaching people about Microsoft products.
01:59 Yeah, I'm not so sure.
02:00 I'm not so sure I'm going that far.
02:02 In fact, what I would really like to see is other databases, other database providers take this on and go, this is a cool idea.
02:11 Can we put this in other places?
02:13 Like on MySQL, on MongoDB, I think it would be super cool to see it there.
02:18 I mean, you kind of have that with SQLite in reverse and like your database runs in your machine learning process rather than your machine learning runs in your database process.
02:25 But if you're already, for some reason, using SQL Server, like, you know, check and you want to do machine learning, check this out.
02:31 This is a pretty cool feature.
02:32 Yeah, that's pretty neat.
02:33 All right.
02:34 Awesome.
02:34 Okay.
02:35 That's really neat.
02:36 I want to talk about some real fake stuff and actually a tool called Faker.
02:41 So there's an article to introduce.
02:43 Faker has been around for a while, but there's a new article on the Semaphore blog called Generating Fake Data for Python Unit Tests with Faker.
02:52 I had heard of it and I hadn't played with it before.
02:55 So the article is pretty neat, but I played around with it just this afternoon.
02:59 And what Faker is, is a way to, you know, basically generate data for you, just random stuff, but in the right format.
03:07 And the list of stuff that Faker can handle to generate is definitely can do like the lorem ipsum type things, just some random text.
03:17 But you can also do credit card numbers and phone numbers and URLs and a lot of stuff that you would want to do to be able to fill out a set of data to make it look real and without, to test a system.
03:32 That's cool.
03:33 Without having to.
03:34 Yeah, I see two major uses for this.
03:36 And I agree.
03:37 Like Faker is awesome.
03:38 It's no joke.
03:39 So basically you install Faker and you can go to it and say, give me some words.
03:45 Give me a name.
03:46 And if you say Faker dot name, it'll be like Joshua Wheeler.
03:50 Give me a month for, give me a sentence.
03:52 And I'll give you a sentence.
03:52 Give me a state, Michigan.
03:54 Give me a random number.
03:55 Like you can ask for all these different things.
03:57 One of the really good uses for this is if you're doing web development and you don't have any data yet, it is super hard to even write the code to process the sequences, but also very hard to do the design of like, well, how is this supposed to look?
04:10 And having real ish data makes that process so much easier.
04:15 And it's really easy to go, give me a month.
04:17 Give me a year.
04:18 Give me a state.
04:19 Things like that.
04:20 And generate fake data with this.
04:22 The other one obviously is with testing, right?
04:24 Like instead of having like all the trouble of coming up with these things for the fake pieces of data you're going to pass and you don't necessarily want to hard code it.
04:32 Maybe that's going to put some dependency on that hard coded value in your test.
04:35 Like just run Faker across your objects and fill them up.
04:38 It has some in it, some things that you don't really think about.
04:41 Like I ran the phone number a few times and it listed phone numbers with extensions, phone numbers with dashes, phone numbers without, so phone numbers with parentheses and stuff that you probably should deal with.
04:54 But might not come up with on your own.
04:56 And then I was looking through and one of the neat things is it has pi structures too.
05:01 It has a, under the py section, you can generate a pi dictionary or basically get a dictionary or a tuple or set.
05:11 And it just comes up with random tuples and random dictionaries.
05:14 It's pretty cool.
05:15 Oh, wow.
05:15 How cool.
05:16 I didn't even know about the pi section.
05:17 You can also switch it to multilingual.
05:20 So US English, Japanese, Italian, Russian.
05:22 And so on.
05:24 Yeah.
05:24 So if you were like doing localization, like what would it be like if I got a Russian name in here?
05:29 Would my system still work?
05:31 Like, well, try it.
05:32 So that's pretty cool.
05:34 I like it.
05:35 Yeah.
05:35 Yeah.
05:35 If you need fake data, check out Faker.
05:37 Seems funny to say, but you know.
05:40 Yeah.
05:41 Indeed it does.
05:42 So I, Brian, I totally skipped over your first one with Stack Overflow Trends.
05:47 That's pretty exciting.
05:47 Oh, yeah.
05:48 So let's go ahead and talk about it.
05:50 So Stack Overflow Trends, Stack Overflow came out with a tool called Stack Overflow Trends.
05:57 And the article that they have to introduce it, the first example that they show is Python
06:05 overtaking PHP for questions asked per programming language.
06:11 Of course, they only compare to PHP, Perl, and Python.
06:15 And Perl, apparently nobody asks questions about Perl.
06:19 Yeah, Perl's not a growing area of study.
06:22 I think, you know, the closest analogy to this would be like, what does Google Trends do?
06:26 This is like, you know, that does that for searches.
06:28 This is like the same type of tool, but for Stack Overflow popularity.
06:33 Yeah.
06:33 I think it's neat to look at like what kind of questions people are asking and how that grows.
06:40 And there was definitely a steep, so there was, Python was fairly around, fairly flat from like 2008 through 2012.
06:51 Yep.
06:52 And then a sharp curve up just starts taking off.
06:56 So, yeah.
06:57 It's really, it's really great.
06:58 Yeah.
06:58 It's just like somebody flipped a switch in 2012 and like, you know, the Python is growing.
07:04 It's awesome.
07:04 Yeah.
07:06 So if you want to study things, definitely this is a place to go do it.
07:09 You know, maybe you're looking like, what should we base our next project on?
07:12 What are the future trends in programming technologies?
07:15 This is a good tool for that.
07:16 And it's great to see that highlighting Python's growth and popularity.
07:20 Yeah.
07:21 So normally we would have like a sponsor spot right now.
07:25 Yeah.
07:25 But there's like this quiet period, right?
07:27 Yeah.
07:28 Yeah.
07:28 So no sponsor this week, you guys.
07:29 You know, we have upcoming sponsors.
07:31 They kind of plan stuff out in sort of sparsely.
07:34 But if you're out there and you're like, hey, my company wants to get the word out to Python developers, send us a message.
07:39 Just go to the contact page on pythonbytes.fm and we'd love to talk to you.
07:43 All right.
07:44 I get questions all the time from people who are learning to code.
07:47 And one of the guys on Twitter, Alan Jones, sent us a message about a pretty cool Medium article that really is very data-driven about people learning to code.
07:58 So this article is called, We Ask 20,000 People Who They Are and How They're Learning to Code.
08:05 So that's a lot of people.
08:07 Yeah.
08:07 Now, they said, all right.
08:09 They probably did Skype or something because that would be a really big phone bill.
08:12 Yeah.
08:13 I'm going to mail you a letter.
08:14 No.
08:15 So they said, all right, who participated?
08:17 Well, there's 20,000 people who did this survey.
08:19 And most of them have been coding for less than five years.
08:23 62% live outside the U.S.
08:26 This is interesting.
08:27 Their average age of people learning to code is 28.
08:30 So I get messages all the time like, hey, I'm 30.
08:32 There's no way I can learn to code.
08:34 Like, you're with these 20,000 other people, right?
08:36 It's not that uncommon.
08:37 That's actually the average age.
08:39 And if you're over it, right, you know, it's still a lot of, definitely an age range.
08:43 And they have an interesting, there's many interesting pictures in this article and graphs.
08:46 It's a data analysis type thing.
08:49 And they've got average age to learn to code by country.
08:54 So you look at like France and the U.K.
08:58 And those guys are in the 30s on average.
09:01 You look at India and they're in their teens on average, which is, I don't know what that means, but that's interesting.
09:07 Another interesting stat that I thought we could pull out is 19% are women.
09:12 While obviously that is super low compared to where it should be, right?
09:16 That should be 50%.
09:17 But still 19% is, I guess it's higher than I expected.
09:21 And it kind of made me happy because I feel like it's a positive trend, even if it's not where it should be.
09:26 Yeah.
09:26 The average person coding, learning code has been coding for 21 months and 25% of them already have the first job.
09:32 So there's a bunch of cool stats like this that you can go and pull out.
09:37 So check out that article.
09:39 We asked 20,000 people who they are and why they're, or how they're learning to code.
09:43 And almost 60% wanted, 59% wanted to become full stack web developers.
09:48 Yeah.
09:48 It's interesting, right?
09:49 Like the web definitely factors heavy with data science being number two.
09:53 So you can imagine, this is not a Python only study, right?
09:56 There's just people learning the code, but you can imagine Python is playing a heavy role in those two areas.
10:01 They also have a podcast section, which is kind of cool.
10:04 What do you mean?
10:04 They have a section of what podcast people who are learning code listen to.
10:08 Okay.
10:09 Are we on there?
10:10 Talk Python is.
10:11 Talk Python is.
10:11 But I didn't find Python bytes, unfortunately.
10:14 But that's because we're still letting them know.
10:16 Yeah.
10:18 Well, I'm glad that Talk Python is on there.
10:20 That's pretty cool.
10:21 It is pretty cool.
10:22 Thank you.
10:22 Would you say that it's an anomaly that Python bytes wasn't on there?
10:27 I think it's just because we're new.
10:28 We don't really teach people how to code, though.
10:30 No, no, no.
10:31 I think this is...
10:32 Oh, you were trying to do a transition.
10:33 Oh, that's so cool.
10:34 I was trying to.
10:35 So our next item is about anomaly detection.
10:39 Yeah, anomaly detection.
10:40 You have to forgive me.
10:42 It's almost midnight here in Munich.
10:43 That's right.
10:44 You're still on your German tour.
10:45 Yeah.
10:46 Two more days.
10:47 There was a really great article, and I should have written the person's name down, called Introduction to Anomaly Detection.
10:54 And it's kind of a link to Emanuel Ruf.
10:58 Emanuel, I can pronounce that part.
11:01 But using Python, but using it for an interesting piece of need for data analysis is anomaly detection.
11:11 Basically, looking at a whole bunch of data from something and finding the ones that you don't really know what the trend is going to be, but the ones that don't fit whatever the trend is for everything else.
11:25 And it's actually just a fascinating couple of pages on here.
11:30 And there's code samples.
11:32 I'm not doing it justice talking about it, but it's definitely a well-thought-out, well-studied article from datascience.com.
11:42 Yeah, they have a couple of areas that they focus on.
11:45 They've got the types of categories of anomalies, like the ones you might think of, which they call point anomalies.
11:51 So, detecting credit card fraud based on amounts spent.
11:55 Like, I live in the U.S.
11:57 Somebody tried to buy $1,000 worth of lumber in Mexico with my card.
12:00 No, that's probably not okay.
12:02 Real story.
12:04 So, then they have contextual anomalies.
12:08 So, they say, like, sometimes these things make sense, but only within a context.
12:14 So, for example, spending $100 on food every day is totally reasonable on a vacation, but it's odd if you're not on vacation.
12:20 So, can you determine are they on vacation, right?
12:22 Or collectively, like, copying, like, tons of data off network servers might look like you're trying to steal data if it knows that you're doing this all over the place.
12:32 But copying one big file would mean nothing, right?
12:34 Yeah.
12:34 Yeah.
12:35 So, they basically break it down by those two categories.
12:37 It's pretty interesting, all the machine learning-based approaches and stuff.
12:40 Yeah, and the math behind it, like the moving averages and the K-nearest neighbor and K-means algorithms.
12:48 Oh, nice.
12:48 Things like that.
12:49 Yeah, absolutely.
12:49 Very cool.
12:50 I think I'm going to use the K-nearest neighbor just in random conversation tomorrow just to make me sound smarter.
12:56 Where should we go to eat?
12:57 I don't know, but we're going to have to apply the K-nearest neighbor to these restaurant choices and figure it out.
13:02 Yes, definitely.
13:03 So, I want to close this out with a message from the Beware guys.
13:09 So, Beware is a cool project that it really does a bunch of fairly unique things.
13:15 So, it supports running Python apps on things like iOS and Android, macOS apps that are native .app files in Python, two alternate Python implementations, some cross-platform widgets, and a couple of other things.
13:33 So, it's done by Russell Keith McGee, and it's been going on for about four years, so really great.
13:38 And he posted a thing that said, a request for your help.
13:41 So, basically, he's been working for a company that's largely funded the development or the furthering of these projects, right?
13:50 So, they've got extensive improvements for this cross-code compiler, an Android backend, a Django backend for these Toga apps that can be run as web apps or local.
14:01 Windows forums.net, UI for Toga.
14:04 So, you can have a Windows app that has a modern, natural appearance on Windows.
14:08 All sorts of cool stuff, right?
14:09 So, a cool project.
14:11 And so, obviously, with the request for help, you know, what is up, right?
14:16 Well, his contract ended, so now he's like, I don't have all this time and energy I can put on here.
14:22 I've got to go back to work.
14:23 And the reason I'm bringing it up is we've got a lot of projects that are looking at different funding models to allow people to work on it, right?
14:31 There's the pretty standard.
14:33 I'll create a project and then try to sell consulting on top of the project.
14:37 There's more interesting, like, platform-as-a-service type things that people are doing.
14:41 So, the Redash guys that I talked about last week on Talk Python have, like, hosted versions of their open source thing.
14:49 We've got the Scraping Hub guys with Scrapey doing their infrastructure-as-a-service or platform.
14:56 Well, web scraping-as-a-service, right?
14:57 All these are very interesting.
14:58 So, basically, Russell says, hey, could you sponsor my project?
15:02 And, you know, one, check out his page.
15:06 You can become a member and, like, give him $10 and help keep this moving because he's doing a bunch of cool stuff.
15:10 But also, like, if you have a project and it needs funding, think about what he's up to.
15:15 Does it make sense for your project?
15:16 Things like that.
15:17 Yeah.
15:17 I think it's a neat idea.
15:19 I do, too.
15:20 So, I definitely think the Beware project has huge possibilities for where it could help people.
15:27 And, certainly, if you just want to work on an open source project, people ask me all the time, like, hey, can you recommend a project I could work on?
15:33 Because I just want to get started on something and I don't really know enough to pick something myself.
15:38 I think Beware is a really good one.
15:40 They have a very welcoming, explicit way of onboarding people who are new to open source.
15:45 So, that's also a way to help them out.
15:46 Definitely.
15:47 All right.
15:47 Well, good luck to all of them.
15:48 Yeah, good luck.
15:49 That would be cool to see that keep growing because it's doing cool stuff over there in that project.
15:53 All right.
15:55 Well, I have one shout-out for us out of my own personal news, Brian.
15:59 Okay.
16:00 So, there's a brand new PyCon, not major main PyCon, but a regional PyCon, and it's in a pretty sweet place.
16:07 So, I'm starting to think I might have to attend this.
16:10 So, if you check out PyCascades.com, it's in Vancouver, BC in January this year, the next January.
16:19 So, if you want to go up to the Pacific Northwest, one of the more beautiful cities around here, they have things like a PyCon hike as well as all the talks and stuff.
16:29 You know, you can check it out at PyCascades.com.
16:31 That sounds like a lot of fun.
16:33 Yeah.
16:33 Actually, you know, being in the Northwest, you'd think I'd have been to Vancouver.
16:37 I haven't ever been there.
16:38 Here's your chance.
16:38 So, I might have to go up there.
16:40 Yeah, we might have to just jump on the train and go up there.
16:42 Yeah.
16:42 Sounds good.
16:43 Well, I've been, the book is very close.
16:47 So, I've been working late evenings here, getting it ready, working with my editor.
16:52 That right now, it's supposed to, the beta is supposed to be available right before PyCon.
16:58 That is awesome.
16:59 Or I guess technically at the beginning, it's on the 17th.
17:03 So, next Wednesday.
17:04 Yeah.
17:04 So, come check us out at our booth.
17:07 Meet Brian.
17:07 Talk to him about his book.
17:08 And you should be ready by then.
17:11 You'll be looking probably tired.
17:12 Yeah.
17:13 And my family's going to be a little irritated with me.
17:15 I haven't slept for three days.
17:17 So, yeah.
17:18 Perhaps.
17:18 All right.
17:20 Well, Brian, thanks for chatting with me and sharing all this news with everyone.
17:24 Yeah.
17:25 Thank you.
17:25 You bet.
17:26 Bye.
17:26 Bye.
17:26 Thank you for listening to Python Bytes.
17:30 Follow the show on Twitter via at Python Bytes.
17:32 That's Python Bytes as in B-Y-T-E-S.
17:35 And get the full show notes at Pythonbytes.fm.
17:39 If you have a news item you want featured, just visit Pythonbytes.fm and send it our way.
17:43 We're always on the lookout for sharing something cool.
17:46 On behalf of myself and Brian Okken, this is Michael Kennedy.
17:49 Thank you for listening and sharing this podcast with your friends and colleagues.