I helped out with this project, happy to answer any questions. I was involved from the beginning, but my biggest contribution was on the deep learning side that does the tile-matching. I helped with the initial prototype using DIGITS, Caffe, and a bunch of Python. Then Aman Tiwari moved us to a more accurate 34-layer ResNet trained in TensorFlow, and a more efficient nearest neighbor lookup using a new implementation of CoverTree search.
Since you've opened up yourself to questioning :)...Thanks for sharing, I apologize for this very laymanny question:
I clicked on a swimming pool in New York City...there aren't a ton of them in NYC, but very few of the matches have even a spot of blue in them...I know the algorithm is more than just "look for more blue patches"...if I were to explain this to another layperson, what is the most obvious explanation for something that seems more non-intuitive than expected?
Not Kyle, but another of the team involved in the project. When we built the training model, we used OpenStreetmap data to find locations of ~1,000,000 "things". A thousand churches, a thousand water towers, a thousand playgrounds, and so on. For each of those locations, we downloaded a satellite image.
The neural net was then trained to look for the things that make each one of those things distinct; what makes a playground different than a church? It could be patterns, it could be colors, it could be any number of things. (For more precise details, you'll need to talk to Aman or Kyle.) It compares lots of things to lots of things, makes some guesses, and then sees whether those guesses help it correctly determine what we told it was in each tile.
Once the model is trained, it's identified the 1024 "features" that are most significant in correctly distinguishing types of things from each other. We then run every tile of a geographical region through that feature determiner, which converts each tile into a point in our 1024 dimensional space. The search function then identifies a tile, looks up its location, and finds the 100 things closest to it within the 1024 dimensional space.
So, TL;dr: It's not looking for colors, it's looking for computable features, which may or may not be color-specific. (Actually, they're highly non-color-specific: the training model randomly "wiggles" the color to makes sure that it doesn't get too tied to a very precise color.)
In my experience in the USA, OpenStreetMap data that comes from TIGER (?) imports, like heliports, are close but wrong. Playgrounds and watertowers are usually accurate because someone added them by hand. Churches are mixed. Other items are confusing; a hospital is typically a single point, but they tend to be some of the largest buildings in small towns.
We found the same—using OSM data that just had points was problematic, both because of accuracy and because it tended to often be the front door, not the centroid of the object.
I believe we limited our model generation to selecting places that had outlines, and computed the centroid of that outline. One of the benefits of our technique is that we didn't need to be comprehensive—we can throw out lots of places and still have enough to be useful for the model.
Thanks for the explanation...yeah, as I said, you're talking to a layperson :). I just realized that it was possible that the training was done on images where the color scale was changed, so thinking about it in terms of "well blue patches are so easy to see" is obviously wrong.
I'm not a part of this project, but if you look at your tile it's about 2/3 trees (by area), and the results are mostly tiles full of trees. Out of the results from your search, I picked out another tile [0] that I thought was more "swimming pool" with fewer "distractions". The results look like they're more in line with what you were hoping for. You could iterate this way a bit until you find the best prototype for a swimming pool.
everything david (workergnome) said, and also: in the tile you point to, "swimming pool" is most interesting/recognizable feature, but most of the image is just "trees and asphalt". if you pick another image where the pool takes up more of the image, you'll find more of what you're looking for: http://nyc.terrapattern.com/?_ga=1.84865689.1830936426.14642...
good stuff. As far I can see, the current version searches for similar tiles to one that a user clicks on.
How would I train this with labeled training data and custom images, so that I could do searches for specific things, either by typing what I'm looking for into a search field, or uploading an image of it? Sort of like Google images search
On the front end, it's easy——Add a couple lines to a configuration file, make a couple images, and it's done.
The difficult part is that the search is not optimized to be RAM-efficient; each city takes between two and ten gigabytes of RAM. You also need to have several hundred thousand tile images, which are commercially available, but not free.
If you've got a hefty server and the images, then it takes about a computer-day to compute the features and create the search index.
The imagery in their demo is coming from Google Maps, but if you look at the copyright you can see what companies it is coming from. The USDA shoots, if I remember correctly, the whole country every 3 years, and that data is public domain. The commercial sites are constantly shooting new stuff, but it depends on how often Google wants to pay for it. With their purchase of Skybox though, they will likely start getting imagery updated on an extremely frequent basis.
This is quite fun, thanks for sharing. I would be interested in trying it out with different tile sizes too. I keep trying to pick objects that are on the corners of tiles.