Skip to content

Instantly share code, notes, and snippets.

View belenaj's full-sized avatar

Jorge belenaj

View GitHub Profile
SELECT *
FROM customer
WHERE country = '${VAR_COUNTRY}'
IMPORT INTO mytable
FROM LOCAL SECURE CSV
FILE /file.csv
(1, 2 FORMAT = 'YYYY-MM-DD', 3..12)
ENCODING = 'ASCII'
ROW SEPARATOR = 'LF'
SKIP = 1;
COMMIT;
@belenaj
belenaj / docker-run-root.sh
Created February 28, 2019 16:31
Run docker with shell as root
docker run -u 0 -it myImage:tag bash
// ...
def endDate = new Date().clearTime() // today
def startDate = endDate - 30
def newDateParsed
startDate.upto(endDate) {
newDateParsed = it.format("yyyy-MM-dd")
println(newDateParsed)
// ...
def DAYS_BACK = 30
def iterDate = new Date() - DAYS_BACK
def newDateParse
for (i=0; i <DAYS_BACK; i++) {
iterDate = iterDate + 1
newDateParse = iterDate.format("yyyy-MM-dd")
stage("newDateParsed ${newDateParse}") {
@belenaj
belenaj / Dockerfile
Last active September 9, 2020 08:59
[aws cli in Docker Alpine] #docker #awscli
FROM alpine:3.10.3
ENV AWSCLI_VERSION "1.14.10"
RUN apk add --no-cache \
openssh \
python \
py-pip
# installing aws cli
@belenaj
belenaj / Dockerfile
Last active February 4, 2020 18:01
exasol-db Dockerfile with /exasol/cloud-storage-etl-udfs jar . https://github.com/exasol/cloud-storage-etl-udfs
FROM exasol/docker-db:latest
ENV EXA_BUCKET_PATH="/exa/data/bucketfs/bfsdefault/default"
ENV CLOUD_STORAGE_VERSION="0.6.0"
ENV JAR_FILENAME="cloud-storage-etl-udfs-$CLOUD_STORAGE_VERSION.jar"
ADD https://github.com/exasol/cloud-storage-etl-udfs/releases/download/v$CLOUD_STORAGE_VERSION/$JAR_FILENAME $EXA_BUCKET_PATH/$JAR_FILENAME
RUN chmod 775 $EXA_BUCKET_PATH/$JAR_FILENAME
#RUN chown exadefusr:exausers $EXA_BUCKET_PATH/$JAR_FILENAME

filter_book.py

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
import sys

sc = SparkContext('local')
spark = SparkSession(sc)

Exercises

  1. Select all "Harry Potter" books
  2. Book with more pages
  3. Top 5 authors with more written books (assume author in first position in the array, "key" field) (assuming each row is a different book)
  4. Top 5 Genres with more books

  1. Avg. number of pages (needs cleaning)
  2. Per publish year, get the number of authors that published at least one book