« Return to show page
Transcript for Episode #25:
Could we have more in-database machine learning please?
Michael KENNEDY: Hello and welcome to Python Bytes, where we deliver Python news and headlines directly to your earbuds. This is Episode #25, recorded May 10th, 2017. I'm Michael Kennedy.
OKKEN: And I'm Brian Okken.
KENNEDY: We gathered up a bunch of cool Python things to share with you this week. So, Brian, I want to start with some news coming out of Microsoft's biggest developer conference this year. There's some actually Python news, which is kind of cool. You wouldn't expect that, right?
KENNEDY: Yeah, they actually did a whole section on machine learning and A.I. (artificial intelligence) and some very cool things. But what I want to talk about is, what are the biggest databases in the corporate space, most popular ones? It is Microsoft's SQL server. The thing that I would point out or talk about is they've just announced a very interesting feature and I'm kind of hoping other database providers copy this like, straight away. What they've announced is in-database machine learning.
KENNEDY: Yeah, it was crazy, right? Like, 'Wait a minute. What does in-database machine learning even mean?' So, here's the idea. If you are going to transfer a lot of data, machine learning or otherwise, and you've got one server over here with your data and another server that's executing it. Then you've got the network latency, you've got the crossing process boundaries; you've got all sorts of latency work in there. Especially if you have a chatty API (application programming interface), this can be problematic. But what this new feature is, they have now built the ability to run CPython 3.5, in process, in SQL server. You can install external packages, it comes built in with some of the machine learning packages already there. It runs a subset of the Anaconda Distribution included right there. So, inside your database you can basically install Python scripts and do full on machine learning like with zero latency to your data.
OKKEN: That's pretty cool.
KENNEDY: I think it's really cool.
OKKEN: You might have to go back to teaching people about Microsoft products.
KENNEDY: Yeah, not so sure I'm going that far. (Laughs) In fact, what I would really like to see is other databases, other database providers take this on and go, 'This is a cool idea. Can we put this in other places?' Like on my SQL, on MongoDB, I think it would be super cool to see it there. You kind of have that with SQLite in reverse, and like your database runs in your machine learning process, rather than your machine learning runs in your database process. But if you're already, for some reason, using SQL server for machine learning, check this out. This is a pretty cool feature.
OKKEN: Yeah, that's pretty neat.
KENNEDY: Alright. Awesome.
OKKEN: That's really neat. I want to talk about some real fake stuff, actually, a tool called Faker. So, Faker's been around for a while but there's a new article on the Semaphore blog called, "Generating Fake Data for Python Unit Tests with Faker." I had heard of it and I hadn't played with it before, so the article's pretty neat. I played around with it just this afternoon and what Faker is a way to basically generate data for you, just random stuff but in the right format. The list of stuff that Faker can handle to generate, it definitely can do like, the lower MIPS some type things and just some random text. But you can also do credit card numbers, and phone numbers, and URLS. A lot of stuff that you want to do to be able to fill out a set of data to make it look real and test a system.
KENNEDY: I see two major uses for this. I agree that Faker is awesome; it's no joke. They basically install Faker and you can go to it and say, ‘Give me some words. Give me a name.’ If you say faker.name, it'll be like, ‘Joshua Wheeler.’ ‘Give me a month. Give me a sentence.’ And it will give you a sentence. ‘Give me a state.’ ‘Michigan.’ ‘Give me a random number.’ You can ask for all these different things.
One of the really good uses for this is if you're doing a web development and you don't have any data yet, it is super hard to even write the code to process the sequences, but also very hard to do the design of like, ‘How is this supposed to look?’ Having real-ish data makes that process so much easier and it's really easy to go, ‘Give me a month. Give me a year. Give me a state’ – things like that – and generate fake data with this.
The other one obviously is with testing, right? Instead of having all the trouble of coming up with these things for the fake pieces of data you're going to pass, and you don't necessarily want to hard code it. Maybe that's going to put some dependency on that hard-coded value in your test. Just run Faker across your objects and fill them up.
OKKEN: It has some in it some things that you don't really think about. Like I ran the phone number a few times and it listed phone numbers with extensions, phone numbers with dashes, phone numbers without, so phone numbers with parentheses and stuff that you probably should deal with but might not come up with on your own.
I was looking through and one of the neat things is it has py structures too. It has, under the py section, that you can generate a py dictionary. Basically, you get a dictionary or a tuple or set, and it just comes up with random tuples in random dictionaries. It's pretty cool.
KENNEDY: Wow, how cool. I didn't even know about the py section. You can also switch it to multilingual so, U.S. English, Japanese, Italian, Russian, and so on.
KENNEDY: Yeah. If you are doing localization, like, ‘What would it be like if I got a Russian name in here? Would my system still work?’ Well, try it. So, that that's pretty cool.
OKKEN: I like it.
KENNEDY: Yeah, if you need fake data, check out Faker.
OKKEN: (Laughs) Seems funny to say but, you know.
KENNEDY: Yeah, indeed it does.
So, Brian, I totally skipped over your first one, with Stack Overflow Trends. That's pretty exciting.
OKKEN: Oh, yeah. Let’s go ahead and talk about it. So, Stack Overflow Trends… Stack Overflow came out with a tool called Stack Overflow Trends and the article that they have to introduce it, the first example that they show, is “Python overtaking PHP (Hypertext Preprocessor) for questions asked per programming language.” Of course, they the only compare PHP Pearl and Python. Pearl apparently, nobody asked questions about Pearl.
KENNEDY: Pearl is not not a growing area of study. I think now, the closest analogy to this would be, ‘What does google trends do?’ This is like, that does that for searches, this is like the same type of tool but for Stack Overflow popularity.
OKKEN: Yeah, I think it’s neat to look at what kind of questions people are asking and how that grows. There is definitely a steep… So, Python was fairly flat from 2008 through 2012 and then a sharp curve up just starts taking off.
KENNEDY: Yeah, it’s really great. It's just like somebody flipped a switch in 2012. Python is growing; it's awesome. Yes, so if you want to study things, definitely this is a place to go do it. Maybe you're looking at like, ‘What should we base our next project on? What are the future trends in programming technologies?’ This is a good tool for that. And it's great to see that highlighting the Python’s growth and popularity.
OKKEN: Yeah. So, normally we would have a sponsor spot right now.
KENNEDY: Yeah, but there's like this quiet period right.
KENNEDY: So, no sponsor this week, you guys. We have upcoming sponsors, they kind of plan stuff out sparsely. But if you are out there and you're like, ‘Hey, my company wants to get the word out to find Python developers,’ send us a message. Just go to the contact page on pythonbytes.fm and we’d love to talk to you.
Alright, I get questions all the time from people who are learning to code. One of the guys on Twitter, Alan Jones, sent us a message about a pretty cool Medium article that really is very data driven about people learning to code. This article is called, “We asked 20,000 People Who They are and How They're Learning to Code.” So, that's a lot of people.
KENNEDY: Now, they said, ‘Alright.’
OKKEN: They probably did Skype or something, because that would be a really big phone bill.
KENNEDY: Yeah, ‘I'm going to mail you a letter.’
So, they said who participated. Well, there's 20,000 people to this survey and most of them have been coding for less than 5 years, 62% live outside the U.S. This is interesting: their average age of people learning to code is 28. So, I get messages all the time like, ‘I'm 30. There's no way I can learn to code.’ You’re with these 20,000 other people, right? It's not that uncommon, it's actually the average age. And if you're over it, right? Definitely an age range
There's many interesting pictures in this article and graphs; it's a data analysis type-thing. They've got average age to learn the code by country, so you look at like France and the U.K., and those who guys are in the 30s on average. You look at India and they're in their teens on average, which is, I don't know what that means, but it’s interesting.
Another interesting stat that I thought we could pull out is 19% are women. While obviously that is super low compared to where it should be, that should be 50%, but still 19% I guess is higher than I expected. It kind of made me happy because I feel like it's a positive trend even if it's not where it should be. The average person learning codes has been coding for 21 months and 25% of them already have their first job.
So, there's a bunch of cool stats like this, that you can go and pull out. Check out that article, “We asked 20,000 People Who They are and How They're Learning to Code.”
OKKEN: And 59% wanted to become full-stack web developers.
KENNEDY: Yeah, it’s interesting, right? The web definitely factors heavy, with data science being number two. This is not a Python only study, there’s just people learning code, but you can imagine Python is playing a heavy role in those two areas. They also have a podcast section, which is kind of cool.
OKKEN: What do you mean?
KENNEDY: They have a section of what podcasts people who are learning code listen to.
OKKEN: Are we on there?
KENNEDY: Talk Python is. I didn’t find Python Bytes unfortunately, but that's because we are still letting them know.
OKKEN: Yeah. Well, I'm glad that Talk Python is on there. That's pretty cool. Congratulations.
KENNEDY: Thank you.
Would you say that it's in an anomaly that Python Bytes wasn't on there?
OKKEN: I think it's just because we’re new. We don't really teach people how to code though.
KENNEDY: No, no, no…
OKKEN: Oh, you were trying to do a transition! Oh, that's so cool.
KENNEDY: I was trying to. (Laughs)
So, our next item is about anomaly detection.
OKKEN: Yeah, anomaly detection. You have to forgive me, it's almost midnight here in Munich.
KENNEDY: That's right, you're still on your German tour.
OKKEN: Yeah, two more days. There was a really great article, and I should have written the person's name down, called, “Introduction to Anomaly Detection.” (Emmanuelle Rieuf) Using Python but using it for a need for data analysis is anomaly detection. Basically, looking at a whole bunch of data from something and finding the ones that you don't really know what the trend is going to be, but the ones that don't fit whatever the trend is for everything else. It's actually just a fascinating couple of pages on here. There's code samples. I'm not doing it justice talking about it, but it’s definitely a well thought-out, well-studied article from datascience.com.
KENNEDY: Yeah, they have like a couple of areas that they focus on. They've got like, the types of categories of anomalies, like the ones you might think of, which they call point anomalies. So, detecting credit card fraud based on amounts spent, like I live in the U.S., somebody tried to buy $1,000 worth of lumber in Mexico with my card. No, that's probably not okay. Real story.
Then they have contextual anomalies. So, they say sometimes these things make sense but only within a context. For example, spending $100 on food every day is totally reasonable on a vacation, but it’s odd if you're not on vacation. Can you determine, ‘Are they on vacation?’ Or collectively, like copying like tons of data off network servers might look like you're trying to steal data, if it knows that you're doing this all over the place. But copying one big file would be nothing, right? They basically break it down by those two categories. It’s pretty interesting and all the machine learning-based approaches and stuff.
OKKEN: Yeah and the math behind it, like the moving averages and the k-Nearest Neighbor, and K means algorithms. Things like that.
KENNEDY: Yeah, absolutely, very cool.
OKKEN: I think I'm going to use the k-Nearest Neighbor just in random conversation tomorrow, just to make me sound smarter.
KENNEDY: (Laughing) ‘Where should we go to eat? I don't know, but we're going to have to apply the k-Nearest Neighbor to these restaurant choices and figure it out.’
OKKEN: Yes, definitely.
KENNEDY: So, I want to close this out with a message from the Beeware guys. So, Beeware’s a cool project that really does a bunch of fairly unique things. It supports running Python apps on things like iOS and Android, Mac OS apps that are native.app files in Python, to alternate Python implementations, some cross platform widgets, and a couple of other things. It’s done by Russell Keith Magee and it's been going on for about four years, so really great. He posted a thing that said, “A Request for Your Help.” Basically, he's been working for a company that's largely funded the development or the furthering of these projects, so they've got extensive improvements for this cross code compiler, an Android backend, a Django backend for these Tova apps that can be run as web apps or local, Windowsforms.net UI for Togus. You can have like a Windows app that has a modern, natural appearance on Windows, all sorts of cool stuff. So, a cool project. Obviously, with the request for help, you know what is up, right? Well, his contract ended so now he's like, ‘I don't have all this time and energy I could put in here. I got to go back to work.’
The reason I bring it up is, we've got a lot of projects that are looking at different funding models to allow people to work on it. There's the pretty standard ‘create a project and then try to sell consulting on top of the project’. There's more interesting, ‘platform as a service’ type things that people are doing. The Redash guys, that I talked about last week on Talk Python, hosted versions of their open source thing. We've got the Scrapinghub guys with Scrapy doing their infrastructure as a service or platform. All these are very interesting. Basically, Russell says, ‘Hey, could you sponsor my project?’ And you know, 1) check out his page. You can become a member and give him $10 and help keep this moving, because you do a bunch of cool stuff. But also, if you have a project and it needs funding, think about what he's up to, does it make sense for your project, things like that.
OKKEN: I think it’s a neat idea.
KENNEDY: I do, too. So, I definitely think the Beeware project has huge possibilities for where it could help people. And certainly, if you just want to work on an open source project, people ask me all the time, ‘Hey, can you recommend a project that I could work on? I just want to get started on something, and I don't really know enough to pick something myself.’ I think Beeware is a really good one. They have a very welcoming, explicit way of on-boarding people who are new to open source. So, that's also a way to help them out.
OKKEN: Definitely. Well, good luck to all them.
KENNEDY: Yeah, good luck. That would be cool to see that keep growing because it's doing cool stuff over there in that project.
Alright, well I have one shout out for us out of my own personal news, Brian. There's a brand new PyCon, not major, main PyCon, but a regional PyCon and it's in a pretty sweet place, so i'm starting to think I might have to attend this. If you check out pycascades.com, it's in Vancouver, B.C. in January, the next January. So, if you want to go up to the Pacific Northwest, one of the more beautiful cities around here, they have things like a PyCon hike, as well as all the talks and stuff. You can check it out pycascades.com. That sounds like a lot of fun.
OKKEN: Yeah. Actually, you know, being in the northwest you’d think I’ve been to Vancouver; I haven't ever been there.
KENNEDY: Here's your chance.
OKKEN: I might have to go up there.
KENNEDY: Yeah, we might have to just jump on the train and go up there. Sounds good.
OKKEN: Well, the book is very close, so i've been working late evenings here, getting it ready working with my editor. Right now, the beta is supposed to be available right before PyCon or technically, at the beginning, on the 17th. So, next Wednesday. ye no
KENNEDY: So, come check us out at our booth. Meet Brian, talk to him about his book. It should be ready by then, you'll probably be tired.
OKKEN: Yeah, and my family's going to be a little irritated with me.
KENNEDY: ‘I haven’t slept for three days!’
Alright. Well, Brian, thanks for chatting with me and sharing this news with everyone.
OKKEN: Thank you.
KENNEDY: You bet. Bye.
Thank you for listening to Python Bytes. Follow the show on Twitter via @pythonbytes and get the full show notes at pythonbytes.fm. If you have a news item you want featured, just visit pythonbyes.fm and send it our way. We’re always on the lookout for sharing something cool. On behalf of myself and Brian Okken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.