How do I turn my data into GeoSpatial? Or “To Geocode or not to Geocode?”

How do I turn my existing data to GeoSpatial? Or "To Geocode or not to Geocode?"

Maps, Spatial, and Geography are amazing.

I love how it all gets together, database, software, and visualization to give answers you couldn’t get beforehand.

The base for Spatial technology is spatial data.
It can be external GIS data for analysis and enrichment such as demographic data. But what about my internal organizational data? It is usually not spatially aware (i.e. no coordinates, or geometry).

What if we want to really start and analyze our internal data such as customers, suppliers, employees, etc with the spatial power directly in the database?

How can we make it Spatial?!

Usually, you have in the data some location descriptions such as addresses, business names, zip codes, parcel numbers, city names, or the demographic area code that you can use to find the real-world coordinates in a process called “Geocode”.

In this blog post,

I will explain what this “Geocode” process is, and why it might be complex and intimidating for non-geospatial professionals, and even for the bravest Geospatial professionals.

The “Geocode” or “Gecoding” process, in general, uses a Soundex engine which translates those human-readable location descriptions to coordinates.

While it’s not always an easy process, it’s doable.

There are several ways to do it yourself. The most common are:

1. 1. Running a commercial geocoding API service, in your favorite programming language from your servers or from the cloud (if your data is not behind a firewall and it’s not a huge amount)
  2. Using online commercial SAAS tools which you can upload a file with your data.
  3. Out-of-the-box GIS or geocode desktop software which has an internal geocode engine.

Geocode soup

Geocode engines can be online services or locally installed and the most fundamental fact for a good geocoder is first its base reference data. If it’s good and updated with a lot of synonyms for alternative names it has more chances to succeed. If the reference data does not include the street and house you are looking at, even the best text-processing algorithms won’t do.

So, if you want to do large Geocode you probably won’t start getting all the data and develop this text engine.

You will probably want to use a Geocode service, it’s doing all for you, and you don’t have to worry about reference data or complex algorithms.

But, if you have a large database to Geocode it has some drawbacks:

Performance: running a batch of millions of requests to an API is not the fastest way, and it can also be non-stable due to connection errors depending on where it’s all located.
You are sharing your data (even if it’s just a row-ID and an address) with a 3rd party
License limitations: Some Geocode providers, limit the storage of the returned data, read the terms carefully
Costs: millions of records can cost a lot, especially if you are running it again and again cause you are not satisfied with the results.

What about doing all offline with dedicated Geocode software?

For use cases of security, or for use cases such as not compromising your data, or for systems that are behind firewalls.
You can do it all in-house, with Geodata reference data and software.

Ingredients are:

A dedicated GIS software with a good Geocode engine, (usually commercial) or implement an open-source geocode solution.
Purchase commercial address reference data or find, download, and prepare open-source address data if it’s good enough for the area you are geocoding.
You will need a professional GIS/spatial engineer to load the data, configure the service, add synonyms data, and run, and rerun the geocode process, till you are satisfied with the results.

From my experience, after running a lot of geocode for my clients,
let me suggest using what I call the “hybrid” way to geocode which fits whether using a Geocode API or offline tools:

It combine the database power with a Geocode service/engine:

If the data you want to Geocode is not already in the database, load it.
(while I prefer doing it with SQL and database, Python (in libraries such as Pandas, GeoPy) or other languages are also a good solution, depending on the case)
I recommend examining and cleaning the data to Geocode for common errors to assist the results of the process and save money and time:
1. Check for null/empty values.
2. Check for “must be” values: city and street for example. Without them, nothing can be found.
3. “Trim” white spaces from the start/end of the texts, or some unrelated characters such as commas etc.
4. Find and remove numbers or text where they are not supposed to be. (such as numbers in the city or the house number columns, in applicable countries)
5. Find and clean non-legal house numbers such as zero or minus numbers or large house numbers that do not exist in reality (I use the max/min functions for that)
6. Do statistical queries for the street names and cities (count and group by, order top-down) and look for stand-out errors, which you can select a large bunch and fix in one command, investing a few minutes in, it will be worth it later.
7. Remove or flag the bad rows you found that are surely not to be Geocoded, at least in the first phase.
  You can also differentiate the rows which will surely be with only the street center or just the city center.
  Oh, the data!

While it may sound tedious, especially if you are paying for a Geocode service, be sure, those services are not magical. If you give it garbage it will give you back errors or inaccurate results which will make you work harder later, or leave you disappointed plus paying much more money.

After cleaning the data, I suggest trying a “pure join” with the reference data, which you either purchased or created from open source.

Usually, after the cleaning process, it can find 30-70 percent of your data (depending on the quality of both), and the bonus that it is 100% correct:

1. Load the data and learn the relevant columns on both data sets (city, street, house, etc)
2. Make the join query to count how much had been joined
  city->city, Street->street, house->hose, etc.
  casting the data formats if needed.
3. Check the non-joined by making a left join and check briefly what are the most common problems. (use group by and count) If you can quickly mass fix large amounts of them and clean/filter “garbage” data there is no chance to Geocode it, such as irrelevant texts, no house numbers, etc.

Prepare for the “Real” Geocode process:

1. Create a query or a table from the non-joined and check for duplicate addresses (duplicate combinations of city+street+house, or just street+house)
2. While it is applicable on your database that the same address repeats itself (i.e. different customers might have the same address or a lot of shops in the same mall address etc.)
3. It’s a very common and natural mistake to run a Geocode for the same address again and again while its coordinates are the same for all. In most of the cases I had done Geocoding, aggregating those duplicates, I have managed to lower the number of rows to Geocode dramatically.
4. You want to run Geocode for it just once per address! The way to do this aggregation with SQL magic is to create an “Address-ID” for each duplicate address for all the records in the dataset, so you can rejoin after Geocoding only one per address.
  I am using the window function with dense_rank() to do it. It exists in most of the databases.
5. Now, you can create your final query/table to the Geocode by using group by of the address ID created with the texts to Geocode.
Geocode with your favorite service, online or offline. I suggest running a sample first.
While working from the database, some databases such as PostgreSQL, can run APIs with Python so you can directly run calls from the database and write back the results to the table. It can also be the process for each new record to Geocode by a trigger, or if you can’t for security reasons, you can export the data to a CSV file and run the service with it.
1. After getting the Geocode results, examine its metadata, such as the accuracy and the score, and review the lower scores. Try to find out why and if there is a way to bulk fix it.
2. If the quality of the service is not explainable, you can run it easily on another service to check its results. Using a library such as Geopy from within or outside the database makes it fairly easy to run and compare.
3. Rejoin it to your data with the address ID. I recommend to keep also some metadata from the Geocode results such as the accuracy and the score, for future usage.
4. You can now decide if you want to leave the ones that were left out or were with low scores or to go for a combined auto/manual process of the non-Geocoded rows. Usually, I see only leftovers of really garbage data or older data that most of the time can be skipped. I suggest leaving the Auto/Manual process as an option.

After the Geocoding process is done, you will have most of your data Geocoded.

Make some QA on the results in a few different areas and scores with a verified source. It might sound complex but I believe that this method is better than just sending millions of rows to an API which can get you low-quality results and is costly.

After the batch phase of the Geocode, if your data is lively updated you can consider running dynamic calls to a Geocode service, and/or using some look-up tables that contain coordinates or at least standardized names so that it will be easier to Geocode. You can easily use a trigger in a database for this task. Now you can happily run amazing spatial analysis, view your data on maps, and much more!

Happy Geocoding!

If you need help, we will be happy to assist…

——————————————-
I’m Ran Tzkhori, a Geo-Spatial data and Spatial systems expert.

We help organizations find fast answers with the aid of spatial data and spatial SQL in their existing or new systems.

If you need advice, don’t hesitate to reach me.

Author: Ran Tzkhori

GIS developer 🌍 Spatial SQL data analyst, freelancer consultant | cloud GIS| big data expert | GIS System Analyst, Architect

How do I turn my data into GeoSpatial? Or “To Geocode or not to Geocode?”

How do I turn my existing data to GeoSpatial? Or "To Geocode or not to Geocode?"

Maps, Spatial, and Geography are amazing.

I love how it all gets together, database, software, and visualization to give answers you couldn’t get beforehand.

The base for Spatial technology is spatial data.It can be external GIS data for analysis and enrichment such as demographic data. But what about my internal organizational data? It is usually not spatially aware (i.e. no coordinates, or geometry).

What if we want to really start and analyze our internal data such as customers, suppliers, employees, etc with the spatial power directly in the database?

How can we make it Spatial?!

Usually, you have in the data some location descriptions such as addresses, business names, zip codes, parcel numbers, city names, or the demographic area code that you can use to find the real-world coordinates in a process called “Geocode”.

In this blog post,

I will explain what this “Geocode” process is, and why it might be complex and intimidating for non-geospatial professionals, and even for the bravest Geospatial professionals.

The “Geocode” or “Gecoding” process, in general, uses a Soundex engine which translates those human-readable location descriptions to coordinates.

While it’s not always an easy process, it’s doable.

There are several ways to do it yourself. The most common are:

Running a commercial geocoding API service, in your favorite programming language from your servers or from the cloud (if your data is not behind a firewall and it’s not a huge amount)

Using online commercial SAAS tools which you can upload a file with your data.

Out-of-the-box GIS or geocode desktop software which has an internal geocode engine.

So, if you want to do large Geocode you probably won’t start getting all the data and develop this text engine.

You will probably want to use a Geocode service, it’s doing all for you, and you don’t have to worry about reference data or complex algorithms.

But, if you have a large database to Geocode it has some drawbacks:

Performance: running a batch of millions of requests to an API is not the fastest way, and it can also be non-stable due to connection errors depending on where it’s all located.

You are sharing your data (even if it’s just a row-ID and an address) with a 3rd party

License limitations: Some Geocode providers, limit the storage of the returned data, read the terms carefully

Costs: millions of records can cost a lot, especially if you are running it again and again cause you are not satisfied with the results.

What about doing all offline with dedicated Geocode software?

For use cases of security, or for use cases such as not compromising your data, or for systems that are behind firewalls.You can do it all in-house, with Geodata reference data and software.

A dedicated GIS software with a good Geocode engine, (usually commercial) or implement an open-source geocode solution.

Purchase commercial address reference data or find, download, and prepare open-source address data if it’s good enough for the area you are geocoding.

You will need a professional GIS/spatial engineer to load the data, configure the service, add synonyms data, and run, and rerun the geocode process, till you are satisfied with the results.

From my experience, after running a lot of geocode for my clients,let me suggest using what I call the “hybrid” way to geocode which fits whether using a Geocode API or offline tools:

It combine the database power with a Geocode service/engine:

If the data you want to Geocode is not already in the database, load it.(while I prefer doing it with SQL and database, Python (in libraries such as Pandas, GeoPy) or other languages are also a good solution, depending on the case)

I recommend examining and cleaning the data to Geocode for common errors to assist the results of the process and save money and time:

Check for null/empty values.

Check for “must be” values: city and street for example. Without them, nothing can be found.

“Trim” white spaces from the start/end of the texts, or some unrelated characters such as commas etc.

Find and remove numbers or text where they are not supposed to be. (such as numbers in the city or the house number columns, in applicable countries)

Find and clean non-legal house numbers such as zero or minus numbers or large house numbers that do not exist in reality (I use the max/min functions for that)

Do statistical queries for the street names and cities (count and group by, order top-down) and look for stand-out errors, which you can select a large bunch and fix in one command, investing a few minutes in, it will be worth it later.

Remove or flag the bad rows you found that are surely not to be Geocoded, at least in the first phase.You can also differentiate the rows which will surely be with only the street center or just the city center.

While it may sound tedious, especially if you are paying for a Geocode service, be sure, those services are not magical. If you give it garbage it will give you back errors or inaccurate results which will make you work harder later, or leave you disappointed plus paying much more money.

After cleaning the data, I suggest trying a “pure join” with the reference data, which you either purchased or created from open source.

Usually, after the cleaning process, it can find 30-70 percent of your data (depending on the quality of both), and the bonus that it is 100% correct:

Load the data and learn the relevant columns on both data sets (city, street, house, etc)

Make the join query to count how much had been joinedcity->city, Street->street, house->hose, etc.casting the data formats if needed.

Check the non-joined by making a left join and check briefly what are the most common problems. (use group by and count) If you can quickly mass fix large amounts of them and clean/filter “garbage” data there is no chance to Geocode it, such as irrelevant texts, no house numbers, etc.

Prepare for the “Real” Geocode process:

Create a query or a table from the non-joined and check for duplicate addresses (duplicate combinations of city+street+house, or just street+house)

While it is applicable on your database that the same address repeats itself (i.e. different customers might have the same address or a lot of shops in the same mall address etc.)

It’s a very common and natural mistake to run a Geocode for the same address again and again while its coordinates are the same for all. In most of the cases I had done Geocoding, aggregating those duplicates, I have managed to lower the number of rows to Geocode dramatically.

You want to run Geocode for it just once per address! The way to do this aggregation with SQL magic is to create an “Address-ID” for each duplicate address for all the records in the dataset, so you can rejoin after Geocoding only one per address.

I am using the window function with dense_rank() to do it. It exists in most of the databases.

Now, you can create your final query/table to the Geocode by using group by of the address ID created with the texts to Geocode.

Geocode with your favorite service, online or offline. I suggest running a sample first.

After getting the Geocode results, examine its metadata, such as the accuracy and the score, and review the lower scores. Try to find out why and if there is a way to bulk fix it.

If the quality of the service is not explainable, you can run it easily on another service to check its results. Using a library such as Geopy from within or outside the database makes it fairly easy to run and compare.

Rejoin it to your data with the address ID. I recommend to keep also some metadata from the Geocode results such as the accuracy and the score, for future usage.

After the Geocoding process is done, you will have most of your data Geocoded.

Make some QA on the results in a few different areas and scores with a verified source. It might sound complex but I believe that this method is better than just sending millions of rows to an API which can get you low-quality results and is costly.

Happy Geocoding!

If you need help, we will be happy to assist…<img decoding="async" class="emoji" role="img" draggable="false" src="https://s.w.org/images/core/emoji/13.1.0/svg/1f642.svg" alt="🙂" />

——————————————-I’m Ran Tzkhori, a Geo-Spatial data and Spatial systems expert.

We help organizations find fast answers with the aid of spatial data and spatial SQL in their existing or new systems.

If you need advice, don’t hesitate to reach me.<img decoding="async" class="emoji" role="img" draggable="false" src="https://s.w.org/images/core/emoji/13.1.0/svg/2b06.svg" alt="⬆️" />

Author: Ran Tzkhori

Share Post

Leave a Reply Cancel reply

Related Posts

Contact us

Menu

46 Tabenkin, Tel-Aviv

Opening hours: Sunday-Thursday 09:00-18:00

The base for Spatial technology is spatial data.
It can be external GIS data for analysis and enrichment such as demographic data. But what about my internal organizational data? It is usually not spatially aware (i.e. no coordinates, or geometry).

For use cases of security, or for use cases such as not compromising your data, or for systems that are behind firewalls.
You can do it all in-house, with Geodata reference data and software.

From my experience, after running a lot of geocode for my clients,
let me suggest using what I call the “hybrid” way to geocode which fits whether using a Geocode API or offline tools:

If the data you want to Geocode is not already in the database, load it.
(while I prefer doing it with SQL and database, Python (in libraries such as Pandas, GeoPy) or other languages are also a good solution, depending on the case)

Remove or flag the bad rows you found that are surely not to be Geocoded, at least in the first phase.
You can also differentiate the rows which will surely be with only the street center or just the city center.

Make the join query to count how much had been joined
city->city, Street->street, house->hose, etc.
casting the data formats if needed.

If you need help, we will be happy to assist…

——————————————-
I’m Ran Tzkhori, a Geo-Spatial data and Spatial systems expert.

If you need advice, don’t hesitate to reach me.