How to batch load large geographic data to BigQuery easily without errors - part 2
A short tutorial for batch loading large Geospatial data to Google’s Big Query in common and open source tools.
Working in the cloud in Google Big Query Warehouse, analyzing your data, especially spatial data is impressively fast and easy.
With a convenient SQL and spatial SQL especially with the new big query studio which includes also a Python user interface and ML.
Google Big query comes with a lot of ready-to-use spatial data such as open street map and the newly added overture maps data for you to use, but most of the time you will need to add custom data such as GPS tracks or parcels, etc.
In the previous post, I had shown, how to load a single and small file via the console UI.
While this can be done easily, most of the time you will have multiple files, larger than the 100 MB limit to upload from the console and creating Geo tables from.
If you have a lot of files, you will want to automate the process, making it as flawless as possible.
In this tutorial, I will show how you can batch load a lot of large Geo files to Google big query, creating geospatial tables, easily and with minimum errors.
There are two methods to do it:
Loading data to a Google Cloud platform bucket first or
loading directly to Big Query. I will show both ways, and discuss pros and cons per each:
Loading data directly to Big query:
Use this way if you are sure you have a stable and fast connection and if your files are standard and no complex table scheme that, might arise errors.
Create CSV files as in the part 1 blog.
You can run GDAL’s ogr2ogr in batch mode for converting a full folder of GIS data to CSV, such as shapefiles to CSV, using a batch file (depends on your operating system. You can combine projection to wgs84 and select specific data in one script.To connect from your computer and upload data securely you must Install the GCP client.
It’s fast and easy installation process and has a command line interface and also Python API or other sdk’s): just go to the link. its a very easy installation tutorial.
After the SDK’s installation open a command line on your computer:
Enter the command: gcloud init
Enter command: gcloud auth login
You might need to enable the API of the project/account here.
Now, you can run commands and queries in Google Cloud or big query locally:
let’s test it first by running the command:
gcloud projects describe projectNameInYourBQIf it is all ok, you can run now bq with the commands:
Bq [–global_flags]
[–command_flags] [args]
for example, loading one csv as in the console UI:
bq load –source_format=CSV –autodetect DatasetName.NewTableName “C:\data\some.csv”
You need to create or give an existing datasetname and to verify that the path to the CSV is correct.
But what If you have multiple files to upload?
In Windows, you can create a .bat file:
Open a text editor and add:
@echo off
for %%f in (*.csv) do
(
echo %%~nf
bq load --source_format=CSV --allow_quoted_newlines --replace=true DatasetName.%%~nf %%f
)
@echo on
Save as “all files” with A .bat extension in the folder with the CSV files
- The bat file, will take all CSV file names in the folder and run the bq load per each.
Double-click the file or drag it to the cmd window and hit enter
You can zip your files with gzip (.gz) and Google will know how to handle them, so it will save you time in the upload process.
If all is ok, the bq CLI will start uploading the files and then create tables for you.
- You can go to your project on big query and see your data loaded and ready.
- Next you can query, visualize it on map and create as described in the first part blog
Loading data from the GCP bucket:
The downsides for the above method directly to BQ, are if some tables fail to be created, because of some data error (see last blog: common errors)
you will discover it only after waiting for the upload process to finish and only the creation job will find the errors.
It will make you repeating the upload again and again, spending a lot of time, if the data is large, or if you have a low internet connection.
Loading data from the GCP bucket process:
The method is to load the CSV to your bucket in the same GCP region of your project.
Then, create tables from the data loaded to the bucket, so if there are scheme errors, you can fix the command without uploading the files all over.
The caveat is in case the data needs to be fix, it won’t help -:(
Load CSV manually or with the CLI (up to 4Gb) to your bucket:
- If you don’t have a bucket, you need to create one manually or better do it with the command line:
cloud storage buckets create gs://BUCKET_NAME - Now batch load command:
gcloud storage cp *.csv gs://BUCKET_NAME
- If you don’t have a bucket, you need to create one manually or better do it with the command line:
This command will upload all csv files in the current directory you are in.
You can add the —gzip-in-flight-all or short -J flag to the end of the command, so it will gzip your files automatically on the computer and unzip them on the cloud:
gcloud storage cp *.csv gs://my-bucket -J
After loading done, you can create tables on bq manually from the console or from the CLI, one by one or batch!
Batch create tables in a Google Big Query from the GCP bucket:
- After data is loaded to the bucket, You have to create or use a dataset, preferably in the same region as the bucket. You can do it in the console (see the prev. post) or easily in bq CLI:
bq –location=[US] mk -d mydsname - Creating one table manually in the console or in the CLI is with the basic command:
bq load –source_format=CSV –allow_quoted_newlines –autodetect mydatset.mytable gs://mybukect/myfile.csv
I didn’t find a built-in way for taking a bucket of csv’s and automating the table creation.
The way to do it is by a batch file (os depended) or with the python sdk. - Here I will describe a way to do it with just a batch file in Windows:
- we will need the list of file names in the bucket. We can take it from the previous phase (loading the data or create it by getting the list from the bucket via the CLI, writing the output of the ls command:
gcloud storage ls gs://my-bucket>myfilelist.txt - It will create a file list of the bucket path and the files in your bucket, rows will look like this:
gs://mybukect/myfile1.csv
gs://mybukect/myfile2.csv Now we can create new .bat file in notepad and paste:
@echo off
Rem change this to your dataset name:
set ds=mydataset
Rem change this to your filelist name
set file=myfilelist.txt
if exist %file% (
for /F "usebackq tokens=*" %%f in (%file%) do bq load --source_format=CSV --allow_quoted_newlines --autodetect %ds%.%%~nf %%f
) else (echo %file% not found)
@echo on
The Batch file basically reads the file path names and using the bq loads to a new table with the name of the CSV file.
You will need to change the ds variable to your target dataset name and your file list name before running the bat file.- If it ran without errors, you can verify and see the list of tables created by running:
Bq ls mydataset - If you got errors related to the bq load process, you can find common problems and fixes in my previous post
- Optional: Clean all of your CSV’s from the bucket by running:
gcloud storage rm gs://my-bucket/*.csv
(use this rm command only for <100 files, for large file amount there are better methods)
Next phases: viewing and analyzing Big Query results (on map)