How to [Batch] load Geographic data to BigQuery easily, without errors

How to batch load large geographic data to BigQuery easily without errors - part 2

A short tutorial for batch loading large Geospatial data to Google’s Big Query in common and open source tools.

Working in the cloud in Google Big Query Warehouse, analyzing your data, especially spatial data is impressively fast and easy.
With a convenient SQL and spatial SQL especially with the new big query studio which includes also a Python user interface and ML.
Google Big query comes with a lot of ready-to-use spatial data such as open street map and the newly added overture maps data for you to use, but most of the time you will need to add custom data such as GPS tracks or parcels, etc.

In the previous post, I had shown, how to load a single and small file via the console UI.
While this can be done easily, most of the time you will have multiple files, larger than the 100 MB limit to upload from the console and create Geo tables from.
If you have a lot of files, you will want to automate the process, making it as flawless as possible.
In this tutorial, I will show how you can batch load a lot of large Geo files to Google big query, creating geospatial tables, easily and with minimum errors.

There are two methods to do it:
Loading data to a Google Cloud platform bucket first or
loading directly to Big Query. I will show both ways and discuss the pros and cons of each:

Loading data directly to Big query:
Use this way if you are sure you have a stable and fast connection and if your files are standard and no complex table scheme that, might arise errors.

Create CSV files as in the part 1 blog.
You can run GDAL’s ogr2ogr in batch mode for converting a full folder of GIS data to CSV, such as shapefiles to CSV, using a batch file (depending on your operating system. You can combine projection to wgs84 and select specific data in one script.
To connect from your computer and upload data securely you must Install the GCP client.
It’s a fast and easy installation process and has a command line interface and also Python API or other SDKs): just go to the link. It’s a very easy installation tutorial.

After the SDK’s installation open a command line on your computer:

1. Enter the command: gcloud init
2. Enter command: gcloud auth login
3. You might need to enable the API of the project/account here.
4. Now, you can run commands and queries in Google Cloud or big query locally:
  - let’s test it first by running the command:
    gcloud projects describe projectNameInYourBQ
  - If it is all ok, you can run now bq with the commands:
    Bq [–global_flags] [–command_flags] [args]
5. for example, loading one CSV as in the console UI:
  bq load –source_format=CSV –autodetect DatasetName.NewTableName “C:\data\some.csv”
  - You need to create or give an existing datasetname and verify that the path to the CSV is correct.
6. But what If you have multiple files to upload?
  In Windows, you can create a .bat file:
  - Open a text editor and add:

				
					@echo off
for %%f in (*.csv) do
    (
    echo %%~nf
    bq load --source_format=CSV --allow_quoted_newlines --replace=true   DatasetName.%%~nf %%f
    )
@echo on

- - Save as “all files” with A .bat extension in the folder with the CSV files
  - The bat file, will take all CSV file names in the folder and run the bq load per each.
  - Double-click the file or drag it to the cmd window and hit enter
  - You can zip your files with gzip (.gz) and Google will know how to handle them, so it will save you time in the upload process.
  - If all is ok, the bq CLI will start uploading the files and then create tables for you.
  - You can go to your project on Google Big Query and see your data loaded and ready.
  - Next you can query, visualize it on a map, and create it as described in the first part blog

Loading data from the GCP bucket:
The downsides for the above method directly to BQ, are if some tables fail to be created, because of some data error (see last blog: common errors)
you will discover it only after waiting for the upload process to finish and only the creation job will find the errors.
It will make you repeat the upload again and again, spending a lot of time, if the data is large, or if you have a low internet connection.

I recommend using this method:

Loading data from the GCP bucket process:

The method is to load the CSV to your bucket in the same GCP region of your project.
Then, create tables from the data loaded to the bucket, so if there are scheme errors, you can fix the command without uploading the files all over.
The caveat is in case the data needs to be fixed, it won’t help -:(
Load CSV manually or with the CLI (up to 4Gb) to your bucket:

- - If you don’t have a bucket, you need to create one manually or better do it with the command line:
    cloud storage buckets create gs://BUCKET_NAME
  - Now batch load command:
    gcloud storage cp *.csv gs://BUCKET_NAME

This command will upload all csv files in the current directory you are in.
You can add the —gzip-in-flight-all or short -J flag to the end of the command, so it will gzip your files automatically on the computer and unzip them on the cloud:

gcloud storage cp .csv gs://my-bucket -J*

After loading is done, you can create tables on BQ manually from the console or from the CLI, one by one or batch!

Batch create tables in a Google Big Query from the GCP bucket:

After data is loaded to the bucket, You have to create or use a dataset, preferably in the same region as the bucket. You can do it in the console (see the prev. post) or easily in bq CLI:
bq –location=[US] mk -d mydsname
Creating one table manually in the console or the CLI is with the basic command:
bq load –source_format=CSV –allow_quoted_newlines –autodetect mydatset.mytable gs://mybukect/myfile.csv
I didn’t find a built-in way for taking a bucket of CSV’s and automating the table creation.
The way to do it is by a batch file (OS dependent) or with the python SDK.
Here I will describe a way to do it with just a batch file in Windows:
we will need the list of file names in the bucket. We can take it from the previous phase (loading the data or create it by getting the list from the bucket via the CLI, writing the output of the ls command:
gcloud storage ls gs://my-bucket>myfilelist.txt
It will create a file list of the bucket path and the files in your bucket, rows will look like this:
gs://mybukect/myfile1.csv
gs://mybukect/myfile2.csv
Now we can create a new .bat file in Notepad and paste this:

				
					@echo off

Rem change this to your dataset name:

set ds=mydataset

Rem change this to your filelist name

set file=myfilelist.txt

if exist %file% (

    for /F "usebackq tokens=*" %%f in (%file%) do bq load --source_format=CSV --allow_quoted_newlines --autodetect %ds%.%%~nf %%f

       ) else (echo %file% not found)

@echo on

The Batch file reads the file path names and using the BQ, loads to a new table with the name of the CSV file.
You will need to change the DS variable to your target dataset name and your file list name before running the bat file.
If it ran without errors, you can verify and see the list of tables created by running:
BQ ls mydataset
If you got errors related to the BQ load process, you can find common problems and fixes in my previous post
Optional: Clean all of your CSV’s from the bucket by running:
gcloud storage rm gs://my-bucket/*.csv
(use this rm command only for <100 files, for large file amounts there are better methods)

Next phases: viewing and analyzing Big Query results (on map)

Or optimizing the geometry data, for fast queries and performance by converting the WKT to geometry.

If you need help, we will be happy to assist…

——————————————-
I’m Ran Tzkhori, a Geo-Spatial data and Spatial systems expert.

We help organizations find fast answers with the aid of spatial data and spatial SQL in their existing or new systems in-house or in the cloud.

If you need advice, don’t hesitate to reach me.

Happy mapping!

Author: Ran Tzkhori

GIS developer 🌍 Spatial SQL data analyst, freelancer consultant | cloud GIS| big data expert | GIS System Analyst, Architect

How to [Batch] load Geographic data to BigQuery easily, without errors

How to batch load large geographic data to BigQuery easily without errors - part 2

A short tutorial for batch loading large Geospatial data to Google’s Big Query in common and open source tools.

There are two methods to do it:Loading data to a Google Cloud platform bucket first orloading directly to Big Query. I will show both ways and discuss the pros and cons of each:

Loading data directly to Big query:Use this way if you are sure you have a stable and fast connection and if your files are standard and no complex table scheme that, might arise errors.

Create CSV files as in the part 1 blog. You can run GDAL’s ogr2ogr in batch mode for converting a full folder of GIS data to CSV, such as shapefiles to CSV, using a batch file (depending on your operating system. You can combine projection to wgs84 and select specific data in one script.

To connect from your computer and upload data securely you must Install the GCP client. It’s a fast and easy installation process and has a command line interface and also Python API or other SDKs): just go to the link. It’s a very easy installation tutorial.

After the SDK’s installation open a command line on your computer:

Enter the command: gcloud init

Enter command: gcloud auth login

You might need to enable the API of the project/account here.

Now, you can run commands and queries in Google Cloud or big query locally:

let’s test it first by running the command: gcloud projects describe projectNameInYourBQ

If it is all ok, you can run now bq with the commands:

Bq [–global_flags] [–command_flags] [args]

for example, loading one CSV as in the console UI:

bq load –source_format=CSV –autodetect DatasetName.NewTableName “C:\data\some.csv”

You need to create or give an existing datasetname and verify that the path to the CSV is correct.

But what If you have multiple files to upload?

In Windows, you can create a .bat file:

Open a text editor and add:

Save as “all files” with A .bat extension in the folder with the CSV files

Double-click the file or drag it to the cmd window and hit enter

You can zip your files with gzip (.gz) and Google will know how to handle them, so it will save you time in the upload process.

If all is ok, the bq CLI will start uploading the files and then create tables for you.

Loading data from the GCP bucket process:

The method is to load the CSV to your bucket in the same GCP region of your project.

Then, create tables from the data loaded to the bucket, so if there are scheme errors, you can fix the command without uploading the files all over.

The caveat is in case the data needs to be fixed, it won’t help -:(

Load CSV manually or with the CLI (up to 4Gb) to your bucket:

This command will upload all csv files in the current directory you are in.You can add the —gzip-in-flight-all or short -J flag to the end of the command, so it will gzip your files automatically on the computer and unzip them on the cloud:

gcloud storage cp *.csv gs://my-bucket -J

After loading is done, you can create tables on BQ manually from the console or from the CLI, one by one or batch!

Batch create tables in a Google Big Query from the GCP bucket:

If you need help, we will be happy to assist…<img decoding="async" class="emoji lazyloaded" role="img" draggable="false" src="https://s.w.org/images/core/emoji/13.1.0/svg/1f642.svg" alt="🙂" data-ll-status="loaded">

——————————————-I’m Ran Tzkhori, a Geo-Spatial data and Spatial systems expert.

We help organizations find fast answers with the aid of spatial data and spatial SQL in their existing or new systems in-house or in the cloud.

If you need advice, don’t hesitate to reach me.

Happy mapping!

Author: Ran Tzkhori

Share Post

Leave a Reply Cancel reply

Related Posts

Contact us

Menu

46 Tabenkin, Tel-Aviv

Opening hours: Sunday-Thursday 09:00-18:00

There are two methods to do it:
Loading data to a Google Cloud platform bucket first or
loading directly to Big Query. I will show both ways and discuss the pros and cons of each:

Loading data directly to Big query:
Use this way if you are sure you have a stable and fast connection and if your files are standard and no complex table scheme that, might arise errors.

Create CSV files as in the part 1 blog.
You can run GDAL’s ogr2ogr in batch mode for converting a full folder of GIS data to CSV, such as shapefiles to CSV, using a batch file (depending on your operating system. You can combine projection to wgs84 and select specific data in one script.

To connect from your computer and upload data securely you must Install the GCP client.
It’s a fast and easy installation process and has a command line interface and also Python API or other SDKs): just go to the link. It’s a very easy installation tutorial.

let’s test it first by running the command:
gcloud projects describe projectNameInYourBQ

This command will upload all csv files in the current directory you are in.
You can add the —gzip-in-flight-all or short -J flag to the end of the command, so it will gzip your files automatically on the computer and unzip them on the cloud:

gcloud storage cp .csv gs://my-bucket -J*

If you need help, we will be happy to assist…

——————————————-
I’m Ran Tzkhori, a Geo-Spatial data and Spatial systems expert.