Brought to you by Michael and Brian - take a Talk Python course or get Brian's pytest book


Transcript #20: Finding similar but not identical images in 128 bits via Python

Return to episode page view on github
Recorded on Tuesday, Apr 4, 2017.

00:00 Hello and welcome to Python Bytes.

00:03 This is episode 20 where we are delivering Python news and headlines directly to your earbuds.

00:09 I'm Michael Kennedy.

00:10 I'm Brian Okken.

00:11 We've got a bunch of stuff lined up for you today.

00:14 I'm really excited to share especially this first article, which is so clever that you chose Brian.

00:18 Before we do, I want to say thank you.

00:20 Thank you to Rollbar who's back to sponsor a bunch more Python bytes.

00:25 We'll talk more about Rollbar later, but thanks Rollbar.

00:28 That's awesome.

00:28 - Yep, so we were just talking about pictures, like I have many gigabytes of pictures, and if you ran a website that accepted uploads in large numbers of pictures, how do you deal with all that data, especially there's probably a lot of duplicate data, right?

00:44 - I'm not sure, and so this is an interesting article.

00:47 There's an article from JetSetter.com, and they're an invitation-only travel community, but the article is duplicate image detection with perceptual hashing in Python.

00:59 And that actually sounds more--

01:01 - Perceptual hashing, that's awesome.

01:03 - Perceptual hashing, it's awesome.

01:06 And the idea is they've got, I mean, the site's got a bunch of pictures of different places around the world, and they don't want pictures that are mostly close to each other.

01:17 I mean, for family photos, you got a ton that are close to each other, but I get for like, there's a lot of cases where you don't want things that are almost the same.

01:24 Right, like pictures of hotels or pictures of a marina to say here's the view out of the hotel.

01:31 Like if they're going to have a listing on like some location, some hotel, and they ask people to upload them, they don't need like a hundred ones from this one view.

01:38 And if you check out JetSetter.com, it is an intensely photo-heavy site.

01:43 Like I'm pretty impressed with the number of photos on that page.

01:46 With the idea of perceptual hashing, I was definitely interested in reading about this, And I expected it to be a fairly complicated algorithm, but it's actually ingenious.

01:56 And they use Python and transfer the image down to just a 9 by 9 square of gray values, even.

02:06 I don't get how that's enough information, but it is apparently enough to determine whether or not an image is close to another image.

02:14 And they do a delta.

02:17 I'm not going to be able to-- can you explain that much better?

02:19 I can try.

02:20 I mean, when I read, we take a five megapixel image and we generate 128 bit hash, and that means a thing.

02:30 Like that means uniqueness, or actually it means similarity, which is actually more important.

02:34 I was like, okay, I have to figure this out.

02:36 And I guess what they do is they take a large image and they like average it down to a nine by nine, or they say for larger images, like a 17 by 17 image.

02:46 And to determine the similarity, maybe somebody's off by five feet to one side or the other to take a picture of a hotel or a view or something, but if you like kind of average it down to that nine by nine, that's where the similarities kind of collapse into those grids, and then they run an algorithm on that grayscale grid, right?

03:05 - Yeah, and then the interesting thing is that of course it's clear to me that you could come up with a hash algorithm for an image, but the difference in the hashes is enough to tell you how close the image is.

03:19 - Yeah, and it's actually the opposite that really blows me away is like, two similar images that are not the same generate the same hash.

03:27 That's what's the magic.

03:28 Like, that totally blows my mind.

03:29 I could see like, well, obviously, hashes are different, images are different, but images are similar, not the same, hash the same, that blows me away.

03:37 - Yeah, and I like it that it's not that complicated of an algorithm, and it's a fun read.

03:42 - Yeah, that's, you know, so I think, there's a couple levels of interesting that you brought up this article.

03:47 And one of them I think is really interesting is when I first heard that, I thought, okay, one, this is gonna be super hard, super computational.

03:54 Two, maybe this is like machine learning or something like that, like two machines, like two images given to an AI, like a deep learning neural network or something.

04:02 So yeah, these are sufficiently similar in ways that I don't really, people don't really understand, but magic on GPUs and lots of neurons, it works out somehow.

04:12 But the fact that it's really, really a simple algorithm is what's I think kind of special about it, right?

04:18 It's like, hey, there's still lots of places to be clever and not just throw AI plus GPUs at a thing.

04:24 - Yes, definitely, yeah.

04:26 - And not only that, you get to take it with you, right?

04:29 It's available on GitHub.

04:30 - Yeah, they do have it.

04:31 It's a, what is it, pybktree?

04:34 - Pybktree, whatever that means.

04:37 Okay, awesome.

04:38 I'm sure it's part of the algorithm.

04:40 Excellent, so keeping with open source projects that you can go find and just grab and do cool things with.

04:46 One of the listeners pointed me towards, pointed us towards Google Open Source.

04:51 In fact, it was the guy from Google Fire, Python Fire, which we'll talk more about later.

04:56 But he has one of the projects there.

04:59 And on Google Open Source, they've basically created like a listing directory of all of the open source projects.

05:05 Now, many of the projects still live on GitHub.

05:08 But this is like a place where you can go search and analyze and discover projects from Google. And what's cool is you can sort by language. So show me the Python projects, show me the C++ projects, whatever. So I grabbed six or seven interesting projects. I just wanted to run them down for you, Brian. Okay. Yeah, so one of them is Subprocess32, a reliable subprocess module for Python 2.

05:29 Apparently subprocess that built in is not reliable for Python 2. I don't know.

05:35 But I didn't know that either. That's partly why it's interesting to me.

05:39 but also, you know, there it is, that's cool.

05:41 Grumpy, we've talked about Grumpy before.

05:43 Grumpy is Python on Go instead of Python on CPython.

05:46 - Yeah. - Yeah, that's a good one.

05:47 - There's Python Fire, of course.

05:49 - Python Fire, of course, like I pointed out.

05:51 That's a way to take any Python object or module and turn it into a command line interface.

05:56 There's a Python client for Google Maps services.

05:59 So if you wanna consume Google Maps from Python, do it.

06:03 There's Hue, H-Y-O-U, a Python interface for manipulating Google Spreadsheets.

06:10 That's cool, right?

06:11 - Okay, I'm gonna have to try that out, that's me.

06:13 - Yeah, I mean, I've seen the stuff for working with docxlsx files, the Microsoft Office ones, but I didn't know about the Google Spreadsheets, so this is cool.

06:22 Another thing that's always tricky for me is working with OAuth, right?

06:26 There's always this, like, I've got some app, the app needs to go, like, open a browser window, and there's some sort of funky callback, and things happen.

06:33 And so one of the places that's especially challenging, I think is over a command line interface.

06:39 Well, there's OAuth 2 L, I think it's L.

06:43 And what that is, is it's a way, a command line tool to get an OAuth token.

06:48 Just let that sink in for you.

06:50 - Okay.

06:51 - So I wanna log in as Google, I can do that like through my app.

06:53 Like I could basically create a shell script that through the CLI gets an OAuth token from the user.

06:59 That's pretty interesting.

07:01 Okay, and also I talked about the Google Maps API.

07:04 Like that sounds like that's something that's really hard to like unit test or test at all without actually going to Google.

07:10 So there's a mock maps API.

07:12 So a small little app engine app for testing, like basically mocking out Google maps API.

07:18 And last but not least TensorFlow, the amazing deep learning machine learning stuff.

07:24 That's about 50% Python, 50% C++ and a lot of GPUs in action there.

07:31 And I don't know where I read this, but I think that this Google open source location is not just all projects.

07:39 It's projects that they consider still active.

07:41 Okay. Yeah, that's cool.

07:43 I mean, obviously you don't want just like a dumping ground, right?

07:45 Yeah, cool.

07:47 I mean, everything in here looked pretty neat and fresh.

07:49 So it's good.

07:49 It's a fairly neat interface too, with, I guess, with panels and stuff.

07:54 Yeah, it's worth checking out.

07:55 Okay, what do we got next?

07:56 Oh, next is me.

07:57 Yeah, more machine learning type stuff.

08:00 Yeah, so there's a article from Jason Brownlee called, and I just clicked away, "How to Handle Missing Data with Python." And this is something that I definitely deal with, measurement values that deal with at work, but there's, the gist of it is, is a lot of times you're dealing with a lot, large or small data sets, and some of the values are missing.

08:22 And there's a whole bunch of different ways you can deal with missing data, but there are a few of them that he talks about are replacing, you have to know what the magic number is that some data collection will fill in a zero, maybe if there's no data, or some other known number, but all your math is gonna get messed up if you actually just leave that there.

08:44 So there's a couple ways to get around it.

08:46 One of the ways he lists is using magic not a number values, and I think pandas can deal with that correctly and not average those in.

08:55 - Yeah, what I think is really nice about it is like I could be given a CSV file or some sort of data thing, set of data, and I could like work my way through it and maybe find the bad data and fill it in potentially, but his fix are like you run this one line in Pandas and magic happens and it's better, right?

09:14 It's like the fix is so much better than the fixes that I would come up with.

09:17 - Yeah, and I do like that he's talking about different ways to deal with it with NumPy even without Pandas also because he might not be using Pandas, But one of the ways you would do it with any math package really would be to, oh, I guess I don't know how to do that.

09:33 Actually, never mind.

09:35 Filling in the, you'd somehow have to find all of the values anyway and fill them in with, like one of the ways is if you're calculating an average, calculate the average of everything else and then fill in the blanks with the average number.

09:50 - Right, I guess it depends on what you're gonna do.

09:51 Are you gonna average it or are you gonna max it, are you gonna min it?

09:54 You could like push that through, right?

09:55 - Yeah. - Yeah, interesting.

09:56 - The best solution definitely I think is using the not a number and letting the libraries take care of it for you.

10:03 But I wanted to bring this up partly because anybody that's working with data collection or doing math with that has to deal with the fact that sometimes there's not numbers there and you have to deal with it.

10:16 - Okay, awesome.

10:18 He's from machinelearningmastery.com I think.

10:20 And he's got just a ton of cool stuff going on over there, right?

10:25 It's not just this one article, right?

10:27 So if you're into these kinds of things, definitely check it out.

10:29 Yeah, it looks good.

10:31 Okay, so what's up next is the HugRest framework.

10:37 But before we get to them, I want to give Rollbar a hug.

10:40 Rollbar is awesome.

10:41 I've been, as people know, I've been using them for a long time on the websites and the websites are getting more and more traffic.

10:48 And I recently, I'm not sure whether it was a wise decision or not, because I'm really busy with other stuff, but I just got really frustrated with the way my servers are working, the way I could sort of move them around and performance and stuff.

11:00 So I said, that's it.

11:01 One day I just woke up, said, that's it, converting it all to MongoDB.

11:05 And so last, that was last week.

11:07 And that took like three days of rewriting all my sites to Mongo, which I really think Mongo is the right choice.

11:13 And I'm just loving the way it's working now.

11:16 But that was a pretty serious, like take the guts out of all my web apps and stick in a new set of guts that are similar but not entirely compatible.

11:26 I spent a little time with roll bar, and they they they helped me out and find a few problems like where maybe types used to be strings, I compare them where one was no longer a string and they didn't compare the same.

11:37 So I got weird errors, but roll bar made it super easy to track that down.

11:40 So if you want to have reliability, and most importantly, awareness of the state of your apps, plug in Rollbar to your web apps. You can use it in Pyramid, Flask, Django, whatever. Just plug it in and you'll get notifications right away. So be sure to visit rollbar.com/pythonbytes and you'll get a special offer to get started there. And I bet that you definitely noticed those messages but I didn't even notice you were mucking with things and I'm pretty sure that nobody else did or very few people did either. Yeah that's true and thank you for saying that but I actually know how many people ran into problems right? There was a couple, but I got an email from a couple people saying, hey, I had this problem with your app.

12:23 I'm like, I know, but I didn't know your email address.

12:26 But I know what your problem was, and it's already fixed.

12:28 I just couldn't contact them, because they hadn't actually created an account yet.

12:32 So it was really nice to be able to just say, yeah, actually, the problem you're telling me is already fixed.

12:37 I just couldn't communicate that back to you.

12:38 Really sorry about that.

12:40 It's awesome.

12:40 You seem like a big team then because of that.

12:43 Oh, yeah, definitely.

12:45 All the folks here in the cubicle farm, we're busy.

12:47 (both laughing)

12:49 You know, one of the next things that I wanna do is build some nice APIs.

12:54 And I think it's really an interesting time for the web in Python.

12:58 There's a lot of flowers blooming, if you will.

13:01 Right, we've got Pyramid, Django, Flask, those guys are all doing super stuff, and like most of my stuff's Pyramid.

13:08 But we've got to Pronto coming along, Sanic, and another one that I just learned about is called Hug.

13:14 at hug.rest.

13:15 How's that for a name and a domain?

13:18 - Yeah, actually it is.

13:19 It's www.hug.rest.

13:21 - Hug.rest, that's beautiful.

13:22 So Hug is a Python web framework just specifically for building restful, documented, documentable, versionable APIs.

13:33 And it's built both for like super simplicity and flexibility as well as performance.

13:39 So I started looking this up.

13:41 Wow, this is quite interesting.

13:43 Okay, so the idea is you can create an API once and you can consume it in all these different ways.

13:49 So you can import it as a module or a package into your project and use the API that way.

13:55 You can communicate it obviously over HTTP is like a restful API, or it also has a CLI command line interface way to expose that.

14:03 So if you write like some kind of web app, or functionality, you want to expose over an API, but you also want to call it locally.

14:10 It's like the same code.

14:12 It's interesting.

14:13 It's also written in Python 3.

14:15 It's uses Cython all over the place.

14:18 So it's like super fast.

14:20 It's one of the fastest web frameworks out there for these kinds of things.

14:24 At least the non async version, let's say.

14:27 If you compare those, it's pretty cool.

14:29 It's got a decorator model.

14:30 So the code looks really clean.

14:31 Yeah, and the decorator model is cool because the decorator model will do like version management, you can have like version one and version two of the API that have like different data formats and they can just coexist. You get automatic documentation based on that like it'll do type annotations and then like use the type annotations as part of the documentation and things like that. It's a pretty cool simple little framework so you know hug for those guys nice job.

14:55 Definitely. Speaking of CLIs. Yeah speaking of CLIs I was I'm actually working on I had an example I wanted to do that I'm running with the pytest book that I'm working on. And for the front end of it, I was punting before and not using actually putting a front end on the application, but I wanted to at least put a command line interface in. And my first attempt was to go down ArgParse and the particular quirks of this application, I needed subcommands. Actually, just the tutorials I found were out of date, didn't work, and I was having a little bit of difficulty. So I went ahead and tried Click. I'd heard of Click before and hadn't tried it. And man, a tutorial from like three years ago was about what I needed and it works right off the, right away. I've got like half a page of code and my interface, my command line interface is done.

15:50 That's really cool. It's also decorator heavy, right?

15:53 Yeah. In my Sublime Editor, it's colored nicely and my wife walked by and said, "That's such beautiful code." >> Oh, lovely.

16:00 >> Lovely.

16:01 >> Let's take that on many, many levels, right?

16:02 That's awesome. Yeah, that's by Armin Roenker, the guy from Flask.

16:05 So definitely.

16:07 >> Oh, did he do click?

16:08 >> I think so, yeah. I believe so.

16:09 Yeah, nice. Click is cool.

16:11 I've done a little bit of work with it, and I've liked what I've seen.

16:14 >> But I also want to, yeah, we'll talk about it later, but I might want to try adding a different CLI interface to it as well.

16:21 >> Yeah, cool. So the last one that I chose for us is a refresher, back to the fundamentals type things.

16:27 So Python inheritance class and our instance class and static methods demystified.

16:35 So this one is on real python.com.

16:37 And I went over there and checked it out.

16:39 And I said, Okay, real python.com.

16:40 That's cool.

16:41 And then I realized this is actually from Dan Bader.

16:43 And we seem to be covering a lot of Dan's stuff over here.

16:45 And actually have more to say about Dan later still.

16:48 So this was a guest post Dan did for that, although I didn't realize that until I started getting into it.

16:53 And idea was to like demystify what's behind class methods, static methods and regular instance methods.

17:00 If you learn Python classes, if you learn classes and inheritance and object oriented programming only through Python, this will be like, obvious to you.

17:10 But if you come from other languages like C++ or Java or C# or JavaScript, there's differences to the way Python classes and inheritance works.

17:20 And it's worth kind of a compare and contrast.

17:22 So he comes up with a class, and it's got like a regular method, a class method, so an at class method decorator, and takes a CLS parameter and a static method with an at static method decorator and nothing, and basically compares and contrasts how they work.

17:37 And so some of the things that I think are not obvious when you're first getting started is like instance classes, those are pretty straightforward.

17:44 Like you call them on instances like all other languages, but the fact that I can call static methods or class methods on instances.

17:54 That's a little bit funky.

17:55 Right?

17:56 Yeah, that seems a little weird.

17:57 And then the other one, the main one, I think is like, what's the different?

18:00 Why are there two things like static method and class method?

18:02 They seem the same.

18:03 Why are there two and then like, when would I use one versus the other?

18:07 Right, the class method takes a CLS method, which is literally the type that is on.

18:13 And the static method just doesn't.

18:14 But other than that, they seem the same, right.

18:16 So if you're going to say, like interact with the class, like during the class method, if you're going to create an instance of the class, you can use the CLS parameter to support like inheritance and stuff.

18:28 So I got like, let's say a vehicle class and a car like a Tesla car class, that static method could say like allocate a CLS, whatever that is.

18:39 And if you call it on a Tesla static ish function class method, it would actually create a Tesla I would change like the thing, the type that it knows it is, where the static method is just like a grouping.

18:49 So I thought that was interesting.

18:51 - Does the class method follow then the hierarchy then?

18:54 So if I declare a class method on a base class, is it available to the subclass?

19:01 - Yes, always.

19:02 And that's always true for static methods, but the difference is the static method doesn't really know what type it's being called on.

19:08 - Oh, okay.

19:09 - Whereas the class method, it's given the type.

19:12 So if there's like, you call it farther down and the inheritance chain, that whatever level you're at, that instant or that type actually is communicated to it.

19:20 And so you're kind of, you're told where you are in the hierarchy in a class method, where in static it's just like, it's just a method.

19:26 Go for it.

19:27 - Okay. - Yeah.

19:28 - I don't think I've ever used static methods for anything.

19:30 - Yeah, well, they're out there hanging out with their friend in class methods.

19:33 (laughing)

19:34 - Interesting. - Indeed.

19:35 So I have a quick follow-up from the last show.

19:37 David Beaver from Google, and he, the guy who works on Python Fire, sent us a note.

19:42 And you said something to the effect of, look, Python fire is awesome.

19:46 But I Python is a serious dependency to take.

19:49 If I just want to CLI, right.

19:51 And I think that's fair.

19:52 That's fair.

19:54 But he said, Hey, you know what, one of our primary plans is to remove IPython as a dependency, we're just not there yet.

20:00 So if anybody in the audience wants to help those guys move forward, they're totally working on that.

20:06 So Google Fire or Python Fire from Google is definitely getting some interesting thinning out and will be very nice.

20:14 And actually, I like to hear that, that they're working on eventually getting rid of that dependency.

20:20 And it's pretty cool.

20:22 Also that it's something I had mentioned when we talked about Python Fire that your development time is important too.

20:29 And putting an interface together with that is pretty fast.

20:32 So keep that in mind.

20:33 It's not always about optimizing for the machines.

20:36 Definitely.

20:37 Hey, one more follow up is we did cover PDR2 or PDR a couple episodes ago with the DR colors prints out.

20:49 One of the complaints I had was that it didn't look that great on my black terminal.

20:54 I had the same problem.

20:55 I like darker stuff and I'm like, wait, where's all the words?

20:58 They just updated it and I guess yesterday I think.

21:03 And it does have color configuration now.

21:05 So you can drop a Peter2 config file in your home directory and I set my background color to magenta so that it was visible for docs, for visible on both black and white.

21:18 Now it looks great.

21:19 - Oh nice.

21:20 Peter2 now has themes, love it.

21:22 All right, how's the book coming?

21:26 I heard there's a spotting.

21:28 - Yeah, so on Twitter the other day, somebody, a guy named Jacob Jaros, I think that's right, noticed that it was listed on the Pragmatic Publishers website, so it's out there.

21:40 - That's awesome, I love the cover.

21:41 The rocket is cool.

21:43 - Yeah, a '50s sci-fi nerd, so.

21:45 - Yeah, and it's perfect, like, it's '50s, '60s vintage rocket.

21:51 - So how about you?

21:52 - Well, it has been a super busy couple of weeks.

21:55 I've been working on a couple of classes.

21:58 one of them I'm about to release.

22:00 By the time this recording comes out, it will be out so tomorrow basically, a course called using and mastering cookie cutter.

22:08 So really deep dive into what is cookie cutter?

22:11 How do you create and manage projects with cookie cutter?

22:13 I think it's going to be a really fun course.

22:16 And also just a few hours ago launched managing Python dependencies with pip and virtual environments, which Dan Bader, speaking of Dan Bader, came over to join me to write a class for us over here and we're shipping that as well.

22:30 So in that I took that course and I actually learned quite a bit from it.

22:34 It's not just like pip install done.

22:36 It's how do you what is the process that you use to manage your dependencies?

22:41 How do you like what is the thinking and workflow used to evaluate whether a package is worth taking a dependency on and all sorts of cool stuff like that bunch of best practices, launch both of those and I just started selling course bundles on Talk Python Training as well to sort of go along with us.

22:57 So lots of stuff.

22:58 >> That's pretty exciting.

22:59 I got to check out the cookie cutter thing.

23:01 >> Yeah, thanks, Ed.

23:02 It'll be out tomorrow morning.

23:03 For everyone listening, that's today.

23:05 But for you, Brian, that's tomorrow morning.

23:08 The magic of time travel.

23:11 Thanks so much for finding all these great items.

23:13 That was fun as always, Brian.

23:15 >> It was fun for me too.

23:17 And thanks to everybody for all your feedback that you send in.

23:19 >> Yep.

23:20 one, and thank you, Roel Bar, for supporting the show.

Back to show page