« Return to show page
Transcript for Episode #20:
Finding similar but not identical images in 128 bits via Python
00:00 Bytes Transcript
00:00 #20: Finding Similar but Not Identical Images in 128-Bits Via Python
00:00 KENNEDY: Hello and welcome to Python Bytes. This is Episode #20, where we are delivering Python news and headlines directly to your earbuds. I’m Michael Kennedy.
00:00 OKKEN: And I’m Brian Okken.
00:00 We’ve for a bunch of stuff lined up for you today. I’m really excited to share – especially this first article which is so clever – that you chose, Brian.
00:00 we do, I want to say, ‘Thank you.’ Thank you to Rollbar, who’s back to sponsor a bunch more Python Bytes and we’ll talk more about Rollbar later, but thanks, Rollbar.
00:00 That’s awesome.
00:00 we were just talking about pictures. I have many gigabytes of pictures. If you ran a website that accepted uploads in large numbers of pictures, how do you deal with all that data? And it’s probably a lot of duplicate data, right?
00:00 I’m not sure. This is an interesting article. It’s an article from Jetsetter.com and they’re an invitation-only travel community. The article is, “Duplicate Image Detection with Perceptual Hashing in Python.”
00:00 ‘Perceptual Hashing.’ That’s awesome.
00:00 ‘Perceptual Hashing.’ It’s awesome. The idea is, the site’s got a bunch of different pictures of places around the world and they don’t want pictures that are mostly close to each other. For family photos, you’ve got a ton that are close to each other, but I get that there’s a lot of cases where you don’t want things that are almost the same.
00:00 Right. Like pictures of hotels or pictures of a marina to say, ‘Here’s the view out of the hotel.’ If they’re going to have a listing on some location or some hotel and they ask people to upload them, they don’t need a hundred ones from this one view. And if you check out Jetsetter.com, it is an intensely photo-heavy site. I was pretty impressed with the number of photos on that page.
00:00 The idea of perceptual hashing, I was definitely interested in reading about this. I expected it to be a fairly complicated algorithm, but it’s actually ingenious. They use Python and transfer the image down to just a 9x9 square, of gray values even. I don’t get how that’s enough information, but it is apparently enough to determine whether or not an image is close to another image. They do a delta. Can you explain that much better?
00:00 I can try. When I read, ‘We take a 5-megapixel image and we generate a 128-bit hash,’ and that means a thing’s uniqueness, or actually it means similarity which is more important. I was like, ‘Okay, I have to figure this out.
00:00 guess what they do is they take a large image and average it down to a 9x9, or they say for larger images like 17x17 image. To determine the similarity – maybe somebody’s off by 5 feet to one side or the other to take a picture of a hotel or view or something – but of you average it down to that 9x9, that’s where the similarities collapse into those grids. And you can run an algorithm on that gray scale grid, right?
00:00 Yeah, and the interesting thing is that, of course, it’s clear to me that you can come up with a hash algorithm for an image, but the difference in the hashes is enough to tell you how close the image is.
00:00 Yeah, and it’s actually the opposite that really blows me away. It’s like, 2 similar images that are not the same generate the same hash. That’s what’s the magic. That totally blows my mind. I could see that hash is different, images are different. Images are similar not the same, hash is the same. That blows me away.
00:00 Yeah, and I like that it’s not that complicated of an algorithm and it’s a fun read.
00:00 Yeah. There’s a couple levels of interesting that you brought up this article. One of them that I think is really interesting is when I had first head that, I thought, ‘This is going to be super hard, super computational. Two, maybe this is like machine learning or something like that.’ Two machines, like 2 images given to an AI like a deep learning URL network or something. These are sufficiently similar in ways that people don’t really understand but magic on GPUs and lots of neurons. It works out somehow.
00:00 the fact that it’s really a simple algorithm is what I think is special about it. It’s like, ‘Hey, there’s still lots of places to be clever and not just throw AI plus GPUs at a thing.
00:00 Yes, definitely.
00:00 And not only that, you get to take it with you. It’s available on GitHub.
00:00 Yeah, they do have it. What is it, pybktree?
00:00 Pybktree, whatever that means. Okay, awesome. I’m sure it’s part of the algorithm. Excellent.
00:00 keeping with open source projects that you can find and just grab and do cool things with, one of the listeners pointed us towards Google Open Source. In fact, it was the guy from Google Fire, Python Fire, which we’ll talk more about later. He has one of the projects there.
00:00 Google Open Source, they’ve basically created a listing directory of all the open source projects. Many of the projects still live on GitHub, but this is a place where you can go search and analyze and discover projects from Google, and what’s cool is you can sort by language. ‘Show me the Python projects,’ ‘Show me the C++ projects,’ whatever.
00:00 I grabbed 6 or 7 interesting projects. I just wanted to run them down for you Brian. One of them is “subprocess32: A reliable subprocess module for Python 2.” Apparently, subprocess, the built-in, is not reliable for Python 2. I don’t know, but I didn’t know what either. That’s partly why it’s interesting to me but also there it is. That’s cool.
00:00 we’ve talked about Grumpy before. Grumpy is Python on Go, instead of Python on CPython. That’s a good one.
00:00 Python Fire, of course.
00:00 Python Fire, of course, like I pointed out. That’s a way to take any Python object or module and turn it into a command line interface.
00:00 a Python Client for Google Maps Services. If you want to consume Google Maps from Python, do it.
00:00 Hyou, a Python interface for manipulating Google Spreadsheets. That’s cool, right?
00:00 Okay, I’m going to have to try that out. That’s neat.
00:00 Yeah, I’ve seen the stuff for working with Doc XLSX files, the Microsoft Office ones. I didn’t know about the Google Spreadsheet, so this is cool.
00:00 thing that’s always tricky for me is working with OAuth. There’s always this like, ‘I’ve got some app, the app needs to go open a browser window and there’s some sort of funky callback,’ and things happen. One of the places that’s especially challenging, I think, is over a command line interface. Well there’s OAuth2L. It’s a command line tool to get an OAuth token. Just let that sink in for you.
00:00 want to log in as Google, I can do that through my app. I could, basically, create a shell script that through the CLI, get an OAuth token from the user. That’s pretty interesting.
00:00 also, I talked about the Google Maps API. That sound like that’s something that’s really hard to unit test, or test at all, without actually going to Google. There’s a mock_maps_api, so a small, little app engine app for testing, basically mocking out Google Maps API.
00:00 last but not least, TensorFlow, the amazing machine learning stuff. That’s about 50% Python, 50% C++ and a lot of GPUs in the action there.
00:00 I don’t know where I read this but I think that this Google Open Source location is not just all projects, it’s projects they consider still active.
00:00 Okay. Yeah, that’s cool. Obviously, you don’t just want a dumping ground, right? Everything on there looked pretty neat and fresh.
00:00 It’s a fairly neat interface, too, with wood panels and stuff.
00:00 Yeah, it’s worth checking out.
00:00 What do we have next? Oh, next is me.
00:00 More machine learning-type stuff.
00:00 So, there’s an article from Jason Brownlee called, “How to Handle Missing Data with Python”. This is something that I definitely deal with, measurement values, that I deal with at work. The gist of it is a lot of times you’re dealing with a lot of large or small data sets and some of the values are missing. There’s a whole bunch of different ways you can deal with missing data but the few of them that he talks about are replacing, you know, you have to know what the magic number is. Some data collection will fill in a zero maybe, if there’s no data or some other known number but all your math is going to get messed up if you actually just leave that there.
00:00 there’s a couple ways to get around it. One of the ways he lists is using magic not-a-number values. I think Pandas can deal with that correctly and not average those in.
00:00 I think the thing that’s really nice about it is I can be given a CSV file or some sort of data thing, set of data. I can work my way through it and maybe find the bad data and fill it in potentially, but his fixes are like, ‘You run this one line in Pandas and magic happens.’ And it’s better, right? The fixes are so much better than the fixes that I would come up with.
00:00 Yeah, and I do like that he’s talking about different ways to deal with it with NumPy, even without Pandas also, because you might not be using Pandas. One of the ways you would do it with any math package really, would be to… I guess I don’t know how to do that. Never mind.
00:00 somehow have to find all of the values anyway, and fill them in. One of the ways is if you’re calculating an average, calculate the average of everything else and fill in the blanks with the average number.
00:00 Right, I guess it depends on what you’re going to do. Are you going to average it? Are you going to max in a minute? You could push that through, right?
00:00 The best solution I think is using the not-a-number and letting the libraries take care of it for you. I wanted to bring this up, partly because anybody that’s working with data collection and doing math with that has to deal with the fact that sometimes there’s not numbers there and you have to deal with it.
00:00 Okay. Awesome. He’s from machinelearningmastery.com, I think. He’s got just a ton of cool stuff going on over there, it’s not just this one article. If you’re into these kinds of things, definitely check it out.
00:00 what’s up next is the Hug REST framework. But before we get to that, I want to give Rollbar a hug.
00:00 is awesome. As people know, I’ve been using them for a long time on the websites and the websites are getting more and more traffic. I recently – I’m not sure whether it was a wide decision or not because I’m really busy with other stuff – I just got really frustrated with the way my servers are working, the way I could move them around and performance and stuff. So, one day I woke up and said, ‘That’s it. I’m converting it all to MongoDB.’ That was last week. That took 3 days of rewriting all my sites to Mongo. I really think Mongo is the right choice and I just love the way it’s working right now, but that was a pretty serious take the guts out of all my web apps and stick in a new set of guts that are similar but not entirely compatible. I spent a little time with Rollbar and they helped me out. Found a few problems where maybe types used to be strings and I could compare them where one was no longer a string. They didn’t compare the same so I got weird errors. Rollbar made it super easy to track that down.
00:00 you want to have reliability and most importantly, awareness of the state of your apps, plug in Rollbar to your web apps. You can use it in Pyramid, Flask, Django, whatever. Just plug it in and you’ll get notifications right away.
00:00 sure to visit Rollbar.com/pythonbytes and you’ll get a special offer to get started there.
00:00 I bet that you definitely noticed those messages, but I didn’t even notice you were mucking with these things and I’m pretty sure that nobody else did; very few people did either.
00:00 Thank you for saying that, but I actually know how many people ran into problems. There was a couple. I got an email from a couple people saying, ‘Hey, I had this problem with your app.’ I’m like, ‘I know but I didn’t know your email address. I know what your problem was and it’s already fixed.’ I just couldn’t contact them because they hadn’t actually created an account yet. It was really nice to be able to say, ‘The problem you’re telling me about is already fixed. I couldn’t communicate that back to you. Really sorry about that.’
00:00 You seem like a big team then, because of that.
00:00 Yeah, definitely. All the folks here in the cubicle farm were busy. (Laughs)
00:00 of the next things that I want to do is build some nice APIs. I think it’s really an interesting time for the web and Python. There’s a lot of flowers blooming, if you will. We’ve got Pyramid, Django, Flask; those guys are all doing super stuff. Most of my stuff is Pyramid. We’ve got Japronto coming along, Sanic, and another one that I just learned about is called Hug. Hug.Rest. How’s that for a name and a domain?
00:00 Hug is a Python web framework, just specifically for building restful documented, documentable, versionable APIs. It’s built both for super simplicity and flexibility, as well as performance. I started looking and thought, ‘Wow. This is quite interesting.’
00:00 idea is you can create what an API wants and you can consume it in all these different ways. You can import it as a module or a package into your project and the API that way. You can communicate, obviously, over HTTP as a restful API. Or it also has a command line interface way to expose that. If you write some kind of web app or functionality you want to expose over an API and you also want to call it locally, it’s the same code.
00:00 also written in Python 3. It uses Cython all over the place so it’s super-fast; it’s one of the fastest frameworks out there for these kinds of things. At least the non-Async version, let’s say.
00:00 It’s got a decorator model so the code looks really clean.
00:00 Yeah, and the decorator model is cool because the decorator model will do version-management. You can have Version 1 and Version 2 of the API and have different data formats and it can just co-exist. You get automatic documentation based on that. It will do type annotations and then use the type annotations as part of the documentation and things like that. It’s a pretty cool, simple, little framework. So, hug for those guys. Nice job.
00:00 of CLIs, I had an example that I’m running with the PyTest book that I’m working on. For the front end of it, I was punting before and not using putting a front end on the application but I wanted to at least put a command line interface in. My first attempt was to go down ArgParse and the particular quirks of this application I needed subcommands and the tutorials I found were out of date and didn’t work. I was having a little bit of difficulty so I went ahead and tried Click. I’d heard of Click before and hadn’t tried it. A tutorial from like, 3 years ago, was about that I needed and it works right away. Half a page a code and my command line interface is done.
00:00 That’s really cool. That’s also decorator-heavy, right?
00:00 In my sublime editor, it’s colored nicely and my wife walked by and said, ‘That’s such beautiful code.’
00:00 Lovely. Take that on many levels, that’s awesome.
00:00 by Armin Ronacher, the guy from Flask.
00:00 Oh, did he do Click?
00:00 I think so. Click is cool. I’ve done a little bit of work with it and I’ve liked what I’ve seen.
00:00 I also want to try adding a different CLI interface to it as well.
00:00 the last one that I chose for us kind of a refresher, back to the fundamentals-type thing. “Python Instance, Class, and Static Methods Demystified”. So, this one is on realpython.com. I went over there and checked it out. I thought, ‘Realpython.com, that’s cool.’ I didn’t realize this is from Dan Bader. We seem to be covering a lot of Dan’s stuff over here and I have more to say about Dan later.
00:00 of the things that I think are not obvious when you’re first getting started is like, instance classes, those are pretty straight forward. You call them on instances on all other languages. But the fact that I can call static methods or class methods on instances, that’s a little bit funky. That seems a little bit weird.
00:00 other one, the main one I think, is why are there 2 things like static method and class method? They seem the same. Why are there 2? When would I use one versus the other? The class method takes a CLS method which is literally the type that it’s on and the static method just doesn’t. But other than that, they seem the same, right? So, if you’re going to, say, interact with a class during the class method. If you’re going to create an instance of the class, you can use the CLS parameter to support inheritance and stuff.
00:00 if I got, let’s say a vehicle class – a Tesla car class – that static method could say, ‘Allocate a CLS’ whatever that is, if you called it on a Tesla-static-ish function class method, it would actually change the thing, the type that it knows it is, where the static method is just a grouping. I thought that was interesting.
00:00 Does the class method follow, then, the hierarchy? If I declare a base method on a base class, is it available to a subclass?
00:00 Yes, always. And it’s always true for static methods but the difference is the static method doesn’t really know what pipe it’s being called on, whereas the class method is given the type. So, if you call if farther down on the inheritance chain, whatever level you’re at, that type is communicated to it so you’re told where you are in the hierarchy of the class method. Whereas in static, it’s just a method.
00:00 I don’t think I’ve ever used static methods.
00:00 They’re out there hanging out with their friend, class methods.
00:00 I have a quick follow-up from the last show. David Bieber from Google, the guy who works on Python Fire sent us a note.
00:00 said something to the effect of, ‘Look, Python Fire is awesome but iPython is a serious dependency to take if I just want a CLI.’ I think that’s fair. But he said, ‘Hey, one of our primary plans is to remove iPython as a dependency, we’re just not there yet.
00:00 if anybody in the audience wants to help those guys move forward, they’re totally working on that. Python Fire from Google is definitely getting some interesting thinning out and it will be very nice.
00:00 I like to hear that, that they’re working on eventually getting rid of that dependency. That’s pretty cool. It’s something I had mentioned when we talked about Python Fire. Your development time is important, too and putting an interface together with that is pretty fast. So, keep that in mind.
00:00 It’s not always about optimizing for the machines.
00:00 One more follow up is, we did cover pdir2 or pdir a couple of episodes ago, with the dir colors it prints out. One of the complaints I had was that it didn’t look that great on my black terminal.
00:00 I had the same problem. I like darker stuff and I’m like, ‘Wait, where’s all the words?’
00:00 They just updated yesterday (April 4, 2017), I think, and it does have color configuration now. You can drop a .pdir2config file in your home directory. I set my background color to magenta so it was visible for docs and on both black and white and now it looks great.
00:00 Pdir2 now has themes. Love it.
00:00 the book coming? I heard there’s a spotting.
00:00 Yeah, so on Twitter the other day a guy named Jacob Jarosz noticed that it was listed on the Pragmatic Publishers website, so it’s out there.
00:00 I love the cover. The rocket is cool.
00:00 Yeah. A ‘50s sci-fi nerd, so…
00:00 It’s perfect.
00:00 How about you?
00:00 It has been a super busy couple of weeks. I’ve been working on a couple of classes, one of them I’m about to release. By the time this recording comes out, it will be out, so tomorrow basically. A course called, “Using and Mastering Cookiecutter”. A really deep dive into what is Cookiecutter and how you create and manage projects with Cookiecutter. I think it’s going to be a really fun course.
00:00 also, just a few hours ago, I launched, “Managing Python Dependencies with pip and Virtual Environments”, which Dan Bader came over to join me in writing a class for us over here and we’re shipping that as well. I took that course and I actually learned quite a bit from it. It’s not just like, ‘Pip install. Done.’ It’s, ‘What is the process you use to manage your dependencies?’ ‘What is the thinking and workflow you use to evaluate if a package is worth taking a dependency on?’ And all sorts of cool stuff like that. A bunch of best practices.
00:00 launched both of those and I started selling course bundles on Talk Python training as well. They sort of go along with those. Lots of stuff.
00:00 That’s pretty exciting. I’ve got to check out that Cookiecutter thing.
00:00 It will be out tomorrow morning. For everyone listening, that’s today. But for you Brian, that’s tomorrow morning. The magic of time travel.
00:00 so much for finding all these great items. That was fun as always, Brian.
00:00 It was fun for me, too. Thanks to everybody for all your feedback that you sent.
00:00 Thanks, everyone and thanks, Rollbar for supporting the show.
00:00 you for listening to Python Bytes. Follow the show on Twitter via @pythonbytes and get the full show notes at pythonbytes.fm. If you have a news item you want featured, just visit pythonbyes.fm and send it our way. We’re always on the lookout for sharing something cool. On behalf of myself and Brian Okken, this is Michael Kennedy. Thank you for listening and sharing this podcast with your friends and colleagues.