How do I turn my existing data to GeoSpatial? Or "To Geocode or not to Geocode?"
Maps, Spatial, and Geography are amazing.
I love how it all gets together, database, software, and visualization to give answers you couldn’t get beforehand.
The base for Spatial technology is spatial data.
It can be external GIS data for analysis and enrichment such as demographic data. But what about my internal organizational data? It is usually not spatially aware (i.e. no coordinates, or geometry).
What if we want to really start and analyze our internal data such as customers, suppliers, employees, etc with the spatial power directly in the database?
How can we make it Spatial?!
Usually, you have in the data some location descriptions such as addresses, business name, zip code, parcel number, city name, or the demographic area code that you can use to find the real-world coordinates in a process called “Geocode”.
In this blog post,
I will explain what this “Geocode” process is, why it might be complex and intimidating for non-geospatial professionals, and even for the bravest Geospatial professionals.
The “Geocode” or “Gecoding” process, in general, uses a Soundex engine which translates those human-readable location descriptions to coordinates.
While it’s not always an easy process, it’s doable.
There are several ways to do it yourself. The most common are:
Running a commercial geocoding API service, in your favorite programming language from your servers or from the cloud (if your data is not behind a firewall and it’s not a huge amount)
Using online commercial SAAS tools which you can upload a file with your data.
Out-of-the-box GIS or geocode desktop software which has an internal geocode engine.
Geocode engines can be online services or locally installed and the most fundamental fact for a good geocoder is first its base reference data. If it’s good and updated with a lot of synonyms for alternative names it has more chances to succeed. If the reference data does not include the street and house you are looking at, even the best text-processing algorithms won’t do.
So, if you want to do large Geocode you probably won’t start getting all the data and develop this text engine.
You will probably want to use a Geocode service, it’s doing all for you, and you don’t have to worry about reference data or complex algorithms.
But, if you have a large database to Geocode it has some drawbacks:
Performance: running a batch of millions of requests to an API is not the fastest way, and it can also be non-stable due to connection errors depending on where it’s all located.
You are sharing your data (even if it’s just a row-ID and an address) with a 3rd party
License limitations: Some Geocode providers, limit the storage of the returned data, read the terms carefully
Costs: millions of records can cost a lot, especially if you are running it again and again cause you are not satisfied with the results.
What about doing all offline with dedicated Geocode software?
For use cases of security, or for use cases such as not compromising your data, or for systems that are behind firewalls.
You can do it all in-house, with Geodata reference data and software.
Ingredients are:
A dedicated GIS software with a good Geocode engine, (usually commercial) or implement an open-source geocode solution.
Purchase commercial address reference data or find, download, and prepare open-source address data if it’s good enough for the area you are geocoding.
You will need a professional GIS/spatial engineer to load the data, configure the service, add synonyms data, and run, and rerun the geocode process, till you are satisfied with the results.
From my experience, after running a lot of geocode for my clients,
let me suggest using what I call the “hybrid” way to geocode which fits whether using a Geocode API or offline tools:
It combine the database power with a Geocode service/engine:
If the data you want to Geocode is not already in the database, load it.
(while I prefer doing it with SQL and database, Python (in libraries such as Pandas, GeoPy) or other languages are also a good solution, depending on the case)I recommend examining and cleaning the data to Geocode for common errors to assist the results of the process and save money and time:
Check for null/empty values.
Check for “must be” values: city and street for example. Without them, nothing can be found.
“Trim” white spaces from the start/end of the texts, or some unrelated characters such as commas etc.
Find and remove numbers or text where they are not supposed to be. (such as numbers in the city or the house number columns, in applicable countries)
Find and clean non-legal house numbers such as zero or minus numbers or large house numbers that do not exist in reality (I use the max/min functions for that)
Do statistical queries for the street names and cities (count and group by, order top-down) and look for stand-out errors, which you can select a large bunch and fix in one command, investing a few minutes in, it will be worth it later.
Remove or flag the bad rows you found that are surely not to be Geocoded, at least in the first phase.
You can also differentiate the rows which will surely be with only the street center or just the city center.
While it may sound tedious, especially if you are paying for a Geocode service, be sure, those services are not magical. If you give it garbage it will give you back errors or inaccurate results which will make you work harder later, or leave you disappointed plus paying much more money.
After cleaning the data, I suggest trying a “pure join” with the reference data, which you either purchased or created from open source.
Usually, after the cleaning process, it can find 30-70 percent of your data (depending on the quality of both), and the bonus that it is 100% correct:
Load the data and learn the relevant columns on both data sets (city, street, house, etc)
Make the join query to count how much had been joined
city->city, Street->street, house->hose, etc.
casting the data formats if needed.Check the non-joined by making a left join and check briefly what are the most common problems. (use group by and count) If you can quickly mass fix large amounts of them and clean/filter “garbage” data there is no chance to Geocode it, such as irrelevant texts, no house numbers, etc.
Prepare for the “Real” Geocode process:
Create a query or a table from the non-joined and check for duplicate addresses (duplicate combinations of city+street+house, or just street+house)
While it is applicable on your database that the same address repeats itself (i.e. different customers might have the same address or a lot of shops in the same mall address etc.)
It’s a very common and natural mistake to run a Geocode for the same address again and again while its coordinates are the same for all. In most of the cases I had done Geocoding, aggregating those duplicates, I have managed to lower the number of rows to Geocode dramatically.
You want to run Geocode for it just once per address! The way to do this aggregation with SQL magic is to create an “Address-ID” for each duplicate address for all the records in the dataset, so you can rejoin after Geocoding only one per address.
I am using the window function with dense_rank() to do it. It exists in most of the databases.
Now, you can create your final query/table to the Geocode by using group by of the address ID created with the texts to Geocode.
Geocode with your favorite service, online or offline. I suggest running a sample first.
While working from the database, some databases such as PostgreSQL, can run APIs with Python so you can directly run calls from the database and write back the results to the table. It can also be the process for each new record to Geocode by a trigger, or if you can’t for security reasons, you can export the data to a CSV file and run the service with it.
After getting the Geocode results, examine its metadata, such as the accuracy and the score, and review the lower scores. Try to find out why and if there is a way to bulk fix it.
If the quality of the service is not explainable, you can run it easily on another service to check its results. Using a library such as Geopy from within or outside the database makes it fairly easy to run and compare.
Rejoin it to your data with the address ID. I recommend to keep also some metadata from the Geocode results such as the accuracy and the score, for future usage.
You can now decide if you want to leave the ones that were left out or were with low scores or to go for a combined auto/manual process of the non-Geocoded rows. Usually, I see only leftovers of really garbage data or older data that most of the time can be skipped. I suggest leaving the Auto/Manual process as an option.