NoSQL Zone is brought to you in partnership with:

I'm a generalist with a varied past -- I like to hop around so I'm always learning something new and finding ways cross-pollinate different areas in Software Engineering. I got my start working on the Oracle kernel. After some work on application servers, software object composition, and large scale web infrastructure, I'm back to NoSQL databases for the moment. In my ideal world, I'd be working on programming languages. Chris is a DZone MVB and is not an employee of DZone and has posted 5 posts at DZone. You can read more from them at their website. View Full User Profile

Finding Your Location with Geospatial Queries

05.30.2012
| 2938 views |
  • submit to reddit

Geospatial Indexing

Geospatial indexes are indexes on points. Kristina Chodorow has written an excellent description of how that works in her blog post "Mongo in Flatland." The geohashing technique described there makes it possible to search for points near a given point. By maintaining a catalog of points of interest (such as coffee shops, restaurants, etc), you can find the one nearest to you by issuing a query for points near where you are.

However, the capability to search for other points near you doesn't make it easy to directly ask the question "Where am I?"

Suppose you have a polygonal mesh that describes the boundaries of states or countries, and you want to figure out which polygon contains your current location. In other words, given your location, get the name for where you are. Geospatial indexes can't do this directly, because this query isn't about searching for other points. However, there is a solution. I'm going to show you how to do it using a geospatial index. I've built out this example using MongoDB, but the technique should work just as well with other databases that have geospatial indexes, such as Postgres with the PostGIS extension installed.

Implementation Ideas

I had a couple of ideas on how to do this:

  1. Using the boundary polygons, build an inverted index by creating a dictionary that maps from every polygon vertex to its source polygon. Given the query point, search for all the nearby vertices; for each one, retrieve the polygon, and test to see if the query point is within it.
  2. For each boundary polygon, compute its centroid (note that this Wikipedia entry has the forumulas we'll need to find the centroid of a polygon). Create a dictionary that maps centroids to their source polygons. Given a query point, search for the nearest centroids; for each one, retrieve the polygon, and test to see if the query point is within it.

I'm not sure which one of these solutions will work best; I can imagine the first generating a lot of hits because there could be many vertices for some of the polygons if they represent crinkly boundaries with high fidelity. A lot of hits means having to test membership in a lot of polygons, and that will slow things down. But I really won't know until I've got some data to try it out on.

We Need Some Geospatial Data

The first ingredient we need in this recipe is some data to work with. I guessed that the USGS would have some that could download; a google search for "USGS state boundaries" found me this page, which had a download link for state_bounds.zip, a zipped ESRI shapefile.

What the heck is an "ESRI shapefile?" Wikipedia has an entry that describes the shapefile file format. Great! I could write a parser for the file I downloaded above. It was beginning to look like this might be a popular data format; I wondered if I can avoid having to write a parser for it myself?

Sure enough, a Google search for "esri shapefile parser java" came up with a question on stackoverflow that answers the question, and points at the GeoTools library as a parser for these files.

Taking a Look at the Data

I got started by writing a simple application to load the data from the shapefile and report some statistics on it. This would give me some experience with the GeoTools GIS libraries, and I could use the results to help choose one of the implementation strategies described above.

Loading the shapefile turned out not to be so easy. The GeoTools library is a bit opaque. While there's lots of javadoc, it's not clear how to use a lot of the library. To get anything out of it, I found myself introspecting the data structures in the debugger to see what they contained.

My initial attempt to use the shapefile above wasn't yielding any interesting looking features in the code. So I tried some more searches, looking for examples of using shapefiles. I finally found "Data Reading", which offered some examples of schema introspection on the shapefile. As well as that, the example referred to org/geotools/sampleData/statepop.shp, which looked suspiciously like a file path from the samples in the GeoTools download. Sure enough, that file is there.

This example was a lucky find, because it turns out the original shapefile I was working with has some deficiencies. My secondary searches had also found some pages highlighting the difficulties of using state_bounds.shp (gif representation), because it doesn't include state boundaries with oceans; apparently you're supposed to overlay another map on top of this one to get those.

So I switched to using the sample data.

In my first pass of DataLoader.java, I wrote the printSchema() function to examine the schema of the data available in the shapefile. Then I added the iterator at the bottom to go through the features and ask for the useful attributes I had identified, and print them out.

The GeoTools calls were all pulled from the two examples above, and were augmented by using Eclipse's auto-complete and the javadoc to find interesting looking calls to get what I needed. It turned out that the MultiPolygon class had its own getCentroid() method, so I didn't have to resort to implementing that myself.

Loading the Data

Printing out the number of points per polygon revealed that some of the polygons had a couple hundred points or more, so I decided to go with the centroid implementation strategy.

In order to do the query I want, I needed to get the data into my database. I'm not going to cover the basics of setting up a MongoDB instance here, but if you're new to MongoDB, start here.

I'm going to be using MongoDB's Geospatial Indexing. I can use the geoNear command to ask for all the points near a given point, and they will be returned in order of distance, nearest first. I've set up my schema and index (see the comments in the code) so I can look for the centroids nearest my query point.

For this pass over DataLoader, I created the MongoGeo class, with the createGeoIndex() and insertRegion() methods. I added code to DataLoader to create data objects from the items found by the feature iterator, and to use MongoGeo.insertRegion() to add them to the database.

Once I had that working, I ran it to populate the database.

A quick look at the data revealed that it only includes the contiguous states:

$ ./mongo
MongoDB shell version: 2.1.2-pre-
connecting to: test
> use geobox
switched to db geobox
> show collections
bounds
system.indexes
> db.bounds.count();
49
> db.bounds.find({}, {_id:0, stateAbbr:1}).sort({stateAbbr:1});
{ "stateAbbr" : "AL" }
{ "stateAbbr" : "AR" }
{ "stateAbbr" : "AZ" }
{ "stateAbbr" : "CA" }
{ "stateAbbr" : "CO" }
{ "stateAbbr" : "CT" }
{ "stateAbbr" : "DC" }
{ "stateAbbr" : "DE" }
{ "stateAbbr" : "FL" }
{ "stateAbbr" : "GA" }
{ "stateAbbr" : "IA" }
{ "stateAbbr" : "ID" }
{ "stateAbbr" : "IL" }
{ "stateAbbr" : "IN" }
{ "stateAbbr" : "KS" }
{ "stateAbbr" : "KY" }
{ "stateAbbr" : "LA" }
{ "stateAbbr" : "MA" }
{ "stateAbbr" : "MD" }
{ "stateAbbr" : "ME" }
Fetched 20 record(s) in 36ms
Type "it" for more
>

 The count of 49 led me to suspect there was no data for Alaska (AK), so I checked it above. It doesn't matter for this experiment.

 

Querying the Data

Now that I've got a database of polygons and their centroids, I need to be able to query it. I added the findContaining() and pointInPoly() methods to the MongoGeo class. Then I wrote the simple command-line test application, LocationFinder.java, in order to test out the queries.

LocationFinder uses the testLocation() utility method to look for a single object. I added a bunch of random calls to testLocation from main(). In order to generate the test calls, I pulled up Google Maps. To generate each test point, I right clicked at a random spot on the map, and choose "What's here?" When I did that, the latitude and longitude of the point appeared in a pop-up box, and I copied them into a test call. Note the code is implemented to accept (longitude, latitude), in order to match classic
cartesian co-ordinate (x, y), so the points' co-ordinates must be swapped from Google Maps' representation.

I tried a few test points within states, as well as one outside, and these produced the expected results:

(-114.433594, 43.707594) is in Idaho, USA
(-83.583984, 40.713956) is in Ohio, USA
(-92.548828, 31.353637) is in Louisiana, USA
(-90.087891, 25.482951) not found

See the comments in the code for links to the locations I tested.

Final Commentary

When I first found the references to shapefiles and GeoTools, I thought this would be a good learning opportunity. But I don't feel like I've learnt very much about that. Based on the two shapefiles I tried to work with, it seems like shapefiles might be just a flatfile database format with some associated libraries for reading them. Is that really all there is to it?

The number of points in this data seems small enough that we should wonder if it's worth bothering with a database at all. It would be possible to simply load the data into memory at application start time, and use it directly. It looks like GeoTools might even have a means to build the geospatial index in memory to support that (but if not, I could do that myself).

While it's true that there's only a small amount of data here, I built this with something bigger in mind. Even adding countries might not increase the size of the dataset by much, but imagine if we were to add city boundaries. (We might need to look into increasing the number of bits of resolution for the geohashing to make this accurate; see the documentation for details on that.) Cities could make this quite large, and it would make more sense then to build the index this way. The database also provides a convenient common place to regularize and keep information that may come from many sources in different formats. We could augment the information for regions over time. Finding and loading more geolocation data could be a long-term project.

Published at DZone with permission of Chris Westin, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)