Something

How to load Geographic data to BigQuery easily part 2 -Batch load and big data

How to batch load large geographic data to BigQuery easily without errors - part 2

A short tutorial for batch loading large Geospatial data to Google’s Big Query in common and open source tools.

Working in the cloud in Google Big Query Warehouse, analyzing your data, especially spatial data is impressively fast and easy.
With a convenient SQL and spatial SQL especially with the new big query studio which includes also a Python user interface and ML.
Google Big query comes with a lot of ready-to-use spatial data such as open street map and the newly added overture maps data for you to use, but most of the time you will need to add custom data such as GPS tracks or parcels, etc.

In the previous post, I had shown, how to load a single and small file via the console  UI.
While this can be done easily, most of the time you will have multiple files, larger than the 100 MB limit to upload from the console and creating Geo tables from.
If you have a lot of files, you will want to automate the process, making it as flawless as possible.
In this tutorial, I will show how you can batch load a lot of large Geo files to Google big query, creating geospatial tables, easily and with minimum errors.

 

man roll rock up. Image by Phuong Nguyen from Pixaba
Image by Phuong Nguyen from Pixaba

There are two methods to do it:
Loading data to a Google Cloud platform bucket first or
loading directly to Big Query. I will show both ways, and discuss pros and cons per each:

Loading data directly to Big query:
Use this way if you are sure you have a stable and fast connection and if your files are standard and no complex table scheme that, might arise errors.

  1. Create CSV files as in the part 1 blog.
    You can run GDAL’s ogr2ogr in batch mode for converting a full folder of GIS data to CSV, such as shapefiles to CSV, using a batch file (depends on your operating system. You can combine projection to wgs84 and select specific data in one script.

  2. To connect from your computer and upload data securely you must Install the GCP client.
    It’s fast and easy installation process and has a command line interface and also Python API or other sdk’s): just go to the link. its a very easy installation tutorial.

After the SDK’s installation open a command line on your computer:

    1. Enter the command: gcloud init

    2. Enter command: gcloud auth login

    3. You might need to enable the API of the project/account here.

    4. Now, you can run commands and queries in Google Cloud or big query locally:

      • let’s test it first by running the command:
        gcloud projects describe
        projectNameInYourBQ

      • If it is all ok, you can run now bq with the commands:

        Bq [–global_flags] [–command_flags] [args]

    5. for example, loading one csv as in the console UI:

      bq load –source_format=CSV –autodetect DatasetName.NewTableName “C:\data\some.csv”

      • You need to create or give an existing datasetname and to verify that the path to the CSV is correct.

    6. But what If you have multiple files to upload?

      In Windows, you can create a .bat file:

      • Open a text editor and add:

				
					@echo off
for %%f in (*.csv) do
    (
    echo %%~nf
    bq load --source_format=CSV --allow_quoted_newlines --replace=true   DatasetName.%%~nf %%f
    )
@echo on
				
			
      • Save as “all files” with A .bat extension in the folder with the CSV files

      • The bat file, will take all CSV file names in the folder and run the bq load per each.
      • Double-click the file or drag it to the cmd window and hit enter

      • You can zip your files with gzip (.gz) and Google will know how to handle them, so it will save you time in the upload process.

      • If all is ok, the bq CLI will start uploading the files and then create tables for you.

      • You can go to your project on big query and see your data loaded and ready.
      • Next you can query, visualize it on map and create as described in the first part blog

Loading data from the GCP bucket:
The downsides for the above method directly to BQ, are if some tables fail to be created, because of some data error (see last blog: common errors)
you will discover it only after waiting for the upload process to finish and only the creation job will find the errors.
It will make you repeating the upload again and again, spending a lot of time, if the data is large, or if you have a low internet connection.

I recommend using this method:
pexels-obi-onyeador

Loading data from the GCP bucket process:

  1. The method is to load the CSV to your bucket in the same GCP region of your project.

    Then, create tables from the data loaded to the bucket, so if there are scheme errors, you can fix the command without uploading the files all over.

    The caveat is in case the data needs to be fix, it won’t help -:(

  2. Load CSV manually or with the CLI (up to 4Gb) to your bucket:

      • If you don’t have a bucket, you need to create one manually or better do it with the command line:
        cloud storage buckets create gs://BUCKET_NAME
      • Now batch load command:
        gcloud storage cp *.csv
        gs://BUCKET_NAME

This command will upload all csv files in the current directory you are in.
You can add the gzip-in-flight-all or short -J flag to the end of the command, so it will gzip your files automatically on the computer and unzip them on the cloud:

gcloud storage cp *.csv gs://my-bucket -J

After loading done, you can create tables on bq manually from the console or from the CLI, one by one or batch!

Batch create tables in a Google Big Query from the GCP bucket:

  1. After data is loaded to the bucket, You have to create or use a dataset, preferably in the same region as the bucket. You can do it in the console (see the prev. post) or easily in bq CLI:
    bq –location=[US] mk -d mydsname
  2. Creating one table manually in the console or in the CLI is with the basic command:
    bq load –source_format=CSV –allow_quoted_newlines –autodetect mydatset.mytable gs://mybukect/myfile.csv
    I didn’t find a built-in way for taking a bucket of csv’s and automating the table creation.
    The way to do it is by a batch file (os depended) or with the python sdk.
  3. Here I will describe a way to do it with just a batch file in Windows:
  4. we will need the list of file names in the bucket. We can take it from the previous phase (loading the data or create it by getting the list from the bucket via the CLI, writing the output of the ls command:
    gcloud storage ls gs://my-bucket>myfilelist.txt
  5. It will create a file list of the bucket path and the files in your bucket, rows will look like this:
    gs://mybukect/myfile1.csv
    gs://mybukect/myfile2.csv
  6. Now we can create new .bat file in notepad and paste:

				
					@echo off

Rem change this to your dataset name:

set ds=mydataset

Rem change this to your filelist name

set file=myfilelist.txt

if exist %file% (

    for /F "usebackq tokens=*" %%f in (%file%) do bq load --source_format=CSV --allow_quoted_newlines --autodetect %ds%.%%~nf %%f

       ) else (echo %file% not found)

@echo on
				
			
  1. The Batch file basically reads the file path names and using the bq loads to a new table with the name of the CSV file.
    You will need to change the ds variable to your target dataset name and your file list name before running the bat file.

  2. If it ran without errors, you can verify and see the list of tables created by running:
    Bq ls mydataset
  3. If you got errors related to the bq load process, you can find common problems and fixes in my previous post
  4. Optional: Clean all of your CSV’s from the  bucket by running:
    gcloud storage rm gs://my-bucket/*.csv
    (use this rm command only for <100 files, for large file amount  there are better methods)

Next phases: viewing and analyzing Big Query results (on map)

Or optimizing the geometry data, for fast queries and performance by converting the wkt to geometry.
 

If you need help, we will be happy to assist…🙂

——————————————-
I’m Ran Tzkhori, a Geo-Spatial data and Spatial systems expert.

We help organizations find fast answers with the aid of spatial data and spatial SQL in their existing or new systems in house or in the cloud.

If you need advice, don’t hesitate to reach me.⬆️

Happy mapping!

Bucket on sand image by pexels-pixabay
pexels-pixabay

Share Post

Share on facebook
Facebook
Share on twitter
Twitter
Share on pinterest
Pinterest
Share on linkedin
LinkedIn
Share on whatsapp
WhatsApp
Share on telegram
Telegram

Leave a Reply

Your email address will not be published.

מעבר להודעה
1
Hi There,
I am Ran, Mikoom’s Founder.
I will be happy to answer any question.
Just click on the button below to reach me.
Skip to content