It streams all tables (excluding archived ones) into Google Big Query.
This will start up a server which will stream data from data sources like Atlas, MongoDB, S3 and Azure Blob Storage into Google Big Query on a schedule e.g. every day at midnight or every hour. You don't need anything other than this server. When this process is complete it will shut down the server so if you want to add more tables you'll need to start it back up. Note that if the server is on when its scheduled to begin the etl process it will restart.
- https://console.cloud.google.com/compute/instancesAdd Create a fast high memory Google Cloud VM Instance called
slamdata-bigquery
with access scopes set to "Allow full access to all Cloud API" and in the Management, security, disks, networking, sole tenancy" section, in the "Networking" section add the network tagmy-ip-web
then edit the network interface and create an external ip address calledexternal
, make a note of this ip address as its the ip you will use to browse your data and create virtual tables - SSH into this instance and make a note of your user name
- Install Google StackDriver logging agent https://cloud.google.com/logging/docs/agent/installation. This will help in troubleshooting
sudo curl -sSL https://get.docker.com/ | sh
- Go to https://cloud.docker.com/u/slamdata/repository/docker/slamdata/slamdata and find out the latest version number use this version number in place of 1.5.14 in the following steps
sudo docker login
sudo docker pull slamdata/slamdata:1.5.14
sudo docker volume create vdw-volume
sudo docker run --restart=always -d -p 80:80 -v vdw-volume:/var/lib/slamdata/data --name vdw slamdata/slamdata:1.5.14
- Perform the steps in the installation section below (ignore the standalone usage instructions)
- Go to https://www.google.com/search?q=whats+my+ip and make a note of your IP address
- Create a firewall rule https://console.cloud.google.com/networking/firewalls/add called
slamdata-biqquery
with the target tagmy-ip-web
, a source IP range of your ip address and the TCP ports80, 443
- Create a Google Cloud Function https://console.cloud.google.com/functions/ called
stop-then-start-instance-with-metadata
with the runtimeNode.js 8
, the source of theindex.js
andpackage.json
below, pub/sub trigger with a new topic calledslamdata-bigquery-start
, a timeout of 540 seconds, the function to executestartInstancePubSub
and 128MB ram - If you don't already have a BigQuery dataset go to https://console.cloud.google.com/bigquery and create one by selecting your project in the resources and then clicking
Create dataset
- Create a Google Cloud Scheduler job https://console.cloud.google.com/cloudscheduler with the name
slamdata-bigquery-start
the desired frequency (0 0 * * *
is daily at midnight) and timezone, targetPub/Sub
, topicslamdata-bigquery-start
and payload of payload.json below with the details changed to match your project, VM, user name and BigQuery dataset - Now any tables you create which are not archived will be streamed into Google BigQuery on the schedule specified
- We reccomend you shut down the instance when you aren't creating tables, the scheduler and function will start it up for your scheduled etl processes
- In your browser go to the ip address used to browse your data and create virtual tables
- Create a new table
- Add a S3 data source with the bucket uri https://s3.amazonaws.com/post-giraffe and the name Post giraffe
- Pick the following columns
- /Source Post Giraffe/Dir app-data/Data user-events.json/All keys/dateTime/Values Column String/As ISO Datetime/Column Datetime
- /Source Post Giraffe/Dir app-data/Data user-events.json/All keys/dateTime/Values Column Number/As seconds since 170/Column Datetime
- /Source Post Giraffe/Dir app-data/Data user-events.json/All keys/S/Key Column Boolean/As number/Column Number
- /Source Post Giraffe/Dir app-data/Data user-events.json/All keys/MAS/Key Column Boolean/As number/Column Number
- Merge the columns
Values of key dateTime
andValues of key dateTime 2
- Name the virtual table
Emails
- Go to https://console.cloud.google.com/cloudscheduler and run the job
slamdata-bigquery-start
- Go to https://console.cloud.google.com/functions/list, click
stop-then-start-instance-with-metadata
, clickView logs
then click the play button which indicatesStart streaming logs
watch the logs until they stop andFunction execution took
appears - Go to https://console.cloud.google.com/compute/instances, click
slamdata-bigquery
, clickStackdriver Logging
, then click the play button which indicatesStart streaming logs
and watch until they stop andStopped Google Compute Engine Network Daemon.
appears - Go to https://console.cloud.google.com/bigquery, select your project from the resource section, select your dataset, select the table, then click
Preview
sudo apt-get install jq
curl https://gist.githubusercontent.com/beckyconning/fd4574cb6766d4c9680b10b0a81b396e/raw/sd-bq.sh > sd-bq.sh
curl https://gist.githubusercontent.com/beckyconning/fd4574cb6766d4c9680b10b0a81b396e/raw/sd-bq-single.sh > sd-bq-single.sh
chmod +x sd-bq.sh
chmod +x sd-bq-single.sh
If you just want to use this tool without any other setup you may use the following. You don't need this section if you are doing the full setup in Google Cloud with scheduling.
- Perform the steps in the installation section below
- Go to https://console.cloud.google.com/iam-admin/serviceaccounts/create for your Google Cloud project and create a service user
- Grant it the "BigQuery Admin" and "Storage Object Viewer" roles
- Create a "JSON key type" key and download it
- Replace
$PATH_TO_JSON_KEY
inexport GOOGLE_APPLICATION_CREDENTIALS=$PATH_TO_JSON_KEY
with the path to the JSON key you downloaded then run this command - Replace the
$PROJECT_ID
etc in the following./sd-bq.sh $PROJECT_ID $BIG_QUERY_DATASET_ID $VIRTUAL_TABLE_SERVER_URL
and run this command (e.g../sd-bq.sh example_project example_dataset http://slamdata.example.com
)