Skip to content

Instantly share code, notes, and snippets.

@iklobato
Last active July 27, 2023 19:56
Show Gist options
  • Save iklobato/7446d90e727b89f0b4b7b1507b0ab7ae to your computer and use it in GitHub Desktop.
Save iklobato/7446d90e727b89f0b4b7b1507b0ab7ae to your computer and use it in GitHub Desktop.
Moving data between buckets

From bucket to bucket

The given command is a single-line command using gsutil, the command-line tool for interacting with Google Cloud Storage. It performs a high-performance data transfer (copy) operation (cp) from one location in Google Cloud Storage (gs://source) to another location (gs://destination).

gsutil -o "GSUtil:parallel_composite_upload_threshold=150M" \
       -o "GSUtil:parallel_thread_count=10" \
       -o "GSUtil:check_hashes=if_fast_else_skip" \
       -h "Content-Encoding:gzip" \
       -o "GSUtil:sliced_object_download=true" \
       -o "GSUtil:sliced_object_upload=true" \
       -m cp -r "gs://source" "gs://destination"

gsutil: This is the command-line utility for Google Cloud Storage operations.

-o "GSUtil:parallel_composite_upload_threshold=150M": This option specifies the threshold size for parallel composite uploads. When uploading large files exceeding the specified threshold (in this case, 150 megabytes), gsutil will split the files into smaller parts and upload them in parallel for improved performance.

-o "GSUtil:parallel_thread_count=10": This option sets the number of parallel threads used for data transfers. In this case, gsutil is instructed to use 10 parallel threads, which can speed up data transfers, especially when dealing with large files.

-o "GSUtil:check_hashes=if_fast_else_skip": This option controls how gsutil checks data integrity during transfers. By setting this option to if_fast_else_skip, gsutil will perform CRC32c checksum validation if the server supports it, but if not, it will skip the checksum validation to improve transfer speed.

-h "Content-Encoding:gzip": This sets an HTTP header during the data transfer. Specifically, it sets the Content-Encoding header to gzip, which indicates that the data being copied is in GZIP-compressed format. This can be useful for reducing the data size during transfer, especially for text-based data.

-o "GSUtil:sliced_object_download=true" and -o "GSUtil:sliced_object_upload=true": These options enable sliced object download and upload, respectively. Sliced object transfer allows gsutil to perform parallel downloads or uploads for a single object. This can enhance performance for transferring large individual objects.

-m: This option stands for "multi-threaded." It enables parallel (multi-threaded) operations for improved performance during data transfers.

cp -r "gs://source" "gs://destination"

This is the main gsutil cp command, where gs://source is the source location and gs://destination is the target location in Google Cloud Storage. The -r flag is used to indicate that the operation should be recursive, allowing the copying of directories and their contents.

  • This gsutil command performs a high-performance data transfer (copy) operation from one location in Google Cloud Storage (gs://source) to another location (gs://destination). The command includes various gsutil options and flags to optimize the data transfer process, making it faster and more efficient, especially for large data sets and files. The options control parallel uploads, thread count, checksum validation, content compression, and sliced object transfers. The '-m' flag enables multi-threaded operations for further performance improvement.
  • In summary, this command uses various gsutil options and flags to optimize the data transfer process, making it faster and more efficient, especially for large data sets and files in Google Cloud Storage. The parallelism, sliced object transfer, compression, and checksum settings all contribute to improving the performance of the data transfer operation.

Importing data from bucket (from GCS to BigQuery)

This command is using the "bq load" command to load data from a CSV file located in Google Cloud Storage (GCS) into a BigQuery table. Let's break down the command and its parameters:

First get your schema from here and uploads into a local json file.

Screen Shot 2023-07-27 at 13 30 26

bq load --source_format=CSV --skip_leading_rows=1 marketdata.cboe_auxiliary_indices gs://cboe_auxiliary_indices-ax/cai-* ./cboe_auxiliary_indexes.json

bq load: This is the command-line tool bq (BigQuery) used to interact with Google BigQuery. It is used to load data into BigQuery tables.

--source_format=CSV: This parameter specifies the format of the source data, which is CSV in this case. BigQuery supports various source formats, including CSV, JSON, Avro, Parquet, etc.

--skip_leading_rows=1: This parameter indicates the number of rows at the beginning of the CSV file to skip. In this case, it is set to 1, which means the first row of the CSV file will be considered as the header and will be skipped during the loading process.

marketdata.cboe_auxiliary_indices: This is the destination table in BigQuery where the data will be loaded. The table is located in the dataset named marketdata, and its name is cboe_auxiliary_indices.

gs://cboe_auxiliary_indices-ax/cai-*: This is the path to the source data in Google Cloud Storage. The gs:// prefix indicates that it's a GCS path. The cai-* part is a wildcard that matches multiple files with names starting with cai- in the specified bucket or directory.

./cboe_auxiliary_indexes.json: This is the path to a JSON schema file. A JSON schema is used to specify the structure of the data in the CSV file. It defines the names and types of the columns in the CSV file and helps BigQuery to correctly interpret and load the data. The schema file is optional, and if it's not provided, BigQuery will attempt to automatically infer the schema based on the data.

In summary, the command loads CSV data from multiple files (matching the wildcard pattern) located in the GCS bucket/directory gs://cboe_auxiliary_indices-ax/. It skips the first row (header row) of each CSV file. The data is then loaded into the BigQuery table marketdata.cboe_auxiliary_indices. If available, the JSON schema file cboe_auxiliary_indexes.json is used to specify the structure of the data.

Automation

You can check this here in here https://gist.github.com/henriqueblobato/09d71d420ea3e53e4127c9ea75fa1ffa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment