Skip to content

Instantly share code, notes, and snippets.

View 1ambda's full-sized avatar
๐Ÿฆ
in the jungle

Kun 1ambda

๐Ÿฆ
in the jungle
View GitHub Profile
@1ambda
1ambda / yarn-find-apps.sh
Last active March 2, 2019 08:16
yarn-find-apps.sh
#!/usr/bin/env bash
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
function usage() {
echo "Usage:
${0##*/} [-h][-n=APP][-r=SECONDS][-b=DATETIME][-s=STATES][-o=ORDER]
Options:
@1ambda
1ambda / Dockerfile
Created October 26, 2019 00:28
Dockerfile for extending jupyter/docker-stacks' minimal notebook
FROM jupyter/minimal-notebook:1386e2046833
# -----------------------------------------------------------------------------
# --- Constants
# -----------------------------------------------------------------------------
USER $NB_USER
WORKDIR /home/$NB_USER
# Airflow Worker ๊ฐ€ ์‹คํ–‰๋  Kubernetes Pod ์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
executorConfig = ExecutorBuilder(
image = "dask-py38",
resource = { memory: "80960Mi", cpu: "32" },
resourceCapacityType = "SPOT",
resourceNodeSelector = { "compute-type": "airflow-cpu-intensive", ... },
notebookCustomPackages = ["pandas==1.2.3", "pyarrow==3.0.0"],
notebookKernel = "python38",
...
)
from pyspark.sql.functions import *
from pyspark.sql.types import *
# ํ˜„์žฌ ๋””๋ ‰ํ† ๋ฆฌ์— CSV ํŒŒ์ผ์„ ๋‹ค์šด๋ฐ›์€ ํ›„ ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.
# ํ•ด๋‹น ํŒŒ์ผ์˜ ํ™•์žฅ์ž๋Š” `.csv` ๋กœ ๋˜์–ด์žˆ์œผ๋‚˜, ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ์˜ ๊ตฌ๋ถ„์ž๋Š” `\t` (ํƒญ) ์ž…๋‹ˆ๋‹ค
# DataBricks ๋กœ ์‹ค์Šตํ•œ๋‹ค๋ฉด ๊ฒฝ๋กœ๋ฅผ "/FileStore/tables/marketing_campaign.csv" ๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค
df = spark.read.load("./marketing_campaign.csv",
format="csv",
df.printSchema() # ์Šคํ‚ค๋งˆ, ์ฆ‰ ๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
root
|-- ID: integer (nullable = true)
|-- Year_Birth: integer (nullable = true)
|-- Education: string (nullable = true)
|-- Marital_Status: string (nullable = true)
|-- Income: integer (nullable = true)
|-- Kidhome: integer (nullable = true)
|-- Teenhome: integer (nullable = true)
df.count() # ๋กœ๋”ฉํ•œ ๋ฐ์ดํ„ฐ์˜ ์ˆซ์ž๋ฅผ ์„ผ ํ›„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค
df.show() # ๋ฐ์ดํ„ฐ๋ฅผ ์ผ๋ถ€ ์ฝ˜์†”์— ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
df.toPandas() # PySpark ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ•จ์ˆ˜๋กœ, Jupyter ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ํŽธํ•˜๊ฒŒ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
# df.count() ์˜ ๊ฒฐ๊ณผ
2240
# df.toPandas() ์˜ ๊ฒฐ๊ณผ (์ผ๋ถ€ Row, Column ์ƒ๋žต)
ID Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency MntWines ... NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
0 5524 1957 Graduation Single 58138.0 0 0 04-09-2012 58 635 ... 7 0 0 0 0 0 0 3 11 1
1 2174 1954 Graduation Single 46344.0 1 1 08-03-2014 38 11 ... 5 0 0 0 0 0 0 3 11 0
2 4141 1965 Graduation Together 71613.0 0 0 21-08-2013 26 426 ... 4 0 0 0 0 0 0 3 11 0
3 6182 1984 Graduation Together 26646.0 1 0 10-02-2014 26 11 ... 6 0 0 0 0 0 0 3 11 0
# ์ปฌ๋Ÿผ์„ ์„ ํƒํ•˜๊ณ  ์ด๋ฆ„์„ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.
# SQL ์˜ SELECT 'ID' as id, 'Year_Birth' as 'year_birth'... ๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
dfSelected = df.select(
col("ID").alias("id"),
col("Year_Birth").alias("year_birth"),
col("Education").alias("education"),
col("Kidhome").alias("count_kid"),
col("Teenhome").alias("count_teen"),
col("Dt_Customer").alias("date_customer"),
# dfSelected.count() ์˜ ๊ฒฐ๊ณผ
2240
# dfSelected.printSchema()
root
|-- id: integer (nullable = true)
|-- year_birth: integer (nullable = true)
|-- education: string (nullable = true)
|-- count_kid: integer (nullable = true)
|-- count_teen: integer (nullable = true)
# df.rdd.id() ์‹คํ–‰ ๊ฒฐ๊ณผ
<bound method RDD.id of MapPartitionsRDD[25] at javaToPython at NativeMethodAccessorImpl.java:0>
# dfSelected.rdd.id() ์‹คํ–‰ ๊ฒฐ๊ณผ
<bound method RDD.id of MapPartitionsRDD[31] at javaToPython at NativeMethodAccessorImpl.java:0>