Skip to content

Instantly share code, notes, and snippets.

@ianmcook
Last active November 5, 2024 17:13
Show Gist options
  • Save ianmcook/1c4623c995202f98b289395cf63159f2 to your computer and use it in GitHub Desktop.
Save ianmcook/1c4623c995202f98b289395cf63159f2 to your computer and use it in GitHub Desktop.
Use the Snowflake SQL REST API from a shell script with curl and jq to execute a query and download the result partitions in Arrow format
{
"statement": "SELECT * FROM MYTABLE",
"resultSetMetaData": {
"format": "arrowv1"
},
"timeout": 60,
"database": "MYDATABASE",
"schema": "MYSCHEMA",
"warehouse": "MYWAREHOUSE",
"role": "MYROLE"
}
#!/bin/sh
account_ident="ACCOUNT-IDENT"
user="[email protected]"
private_key_path="/path/to/rsa_key.p8"
bearer_token=$(echo "" | snowsql --private-key-path $private_key_path --generate-jwt -a "$account_ident" -u "$user")
subdomain=$(echo "$account_ident" | tr '[:upper:]' '[:lower:]')
curl -o result-metadata.json -X POST \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-H "Authorization: Bearer $bearer_token" \
-H "User-Agent: customApplication/1.0" \
-H "X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT" \
-d "@request-body.json" \
"https://$subdomain.snowflakecomputing.com/api/v2/statements"
statement_handle=$(jq -r '.statementHandle' result-metadata.json)
partitions=$(jq '.resultSetMetaData.partitionInfo | length - 1' result-metadata.json)
curl --parallel --compressed $(seq -f "-o partition-%g.arrows" 1 $partitions) \
-H "Accept: application/vnd.apache.arrow.stream" \
-H "Authorization: Bearer $bearer_token" \
-H "User-Agent: customApplication/1.0" \
-H "X-Snowflake-Authorization-Token-Type: KEYPAIR_JWT" \
$(seq -f "https://$subdomain.snowflakecomputing.com/api/v2/statements/$statement_handle?partition=%g" 1 $partitions)
@ianmcook
Copy link
Author

ianmcook commented Oct 29, 2024

Python code to read the resulting files into a PyArrow Table:

import pyarrow as pa
import glob

def yield_batches(paths):
    for path in paths:
        with open(path, "rb") as file:
            reader = pa.ipc.open_stream(file)
            for batch in reader:
                yield batch

paths = glob.glob("*.arrows")
table = pa.Table.from_batches(yield_batches(paths))

@ianmcook
Copy link
Author

ianmcook commented Oct 29, 2024

Alternatively: If you plan to persist the downloaded data files and you want to minimize the associated storage space or cost, do not decompress the files on download. To do this:

  • Remove the --compressed flag from the curl command.
  • Change the curl output flag to -o partition-%g.arrows.gz.

Then you can decompress the files when you read them into Python:

import pyarrow as pa
import glob
import gzip

def yield_batches(paths):
    for path in paths:
        try:
            with gzip.open(path) as file:
                reader = pa.ipc.open_stream(file)
                for batch in reader:
                    yield batch
        except (gzip.BadGzipFile):
            with open(path, "rb") as file:
                reader = pa.ipc.open_stream(file)
                for batch in reader:
                    yield batch

paths = glob.glob("*.arrows.gz")
table = pa.Table.from_batches(yield_batches(paths))

@ianmcook
Copy link
Author

Also see the example code at apache/arrow#44561 (comment) which shows how to create a record batch reader (which enables lazy/deferred reading) instead of eagerly reading the files into a table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment