Created
February 7, 2019 17:20
-
-
Save FavioVazquez/07b688e1285cc852f6fb45ecb67034d8 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Introduction to Optimus" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Install" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"From command line:\n", | |
"\n", | |
"```\n", | |
"pip install optimuspyspark\n", | |
"```\n", | |
"\n", | |
"from a notebook you can use:\n", | |
"\n", | |
"```\n", | |
"!pip install optimuspyspark\n", | |
"```" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"A good thing to do here is to restart the kernel. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Checking the installation\n", | |
"from optimus import Optimus\n", | |
"op= Optimus(master=\"local\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## DataFrame creation\n", | |
"\n", | |
"Create a dataframe to passing a list of values for columns and rows. Unlike pandas you need to specify the column names." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df = op.create.df(\n", | |
" [\n", | |
" \"names\",\n", | |
" \"height(ft)\",\n", | |
" \"function\",\n", | |
" \"rank\",\n", | |
" \"weight(t)\",\n", | |
" \"japanese name\",\n", | |
" \"last position\",\n", | |
" \"attributes\"\n", | |
" ],\n", | |
" [\n", | |
" \n", | |
" (\"Optim'us\", 28.0, \"Leader\", 10, 4.3, [\"Inochi\", \"Convoy\"], \"19.442735,-99.201111\",[8.5344, 4300.0]),\n", | |
" (\"bumbl#ebéé \", 17.5, \"Espionage\", 7, 2.0, [\"Bumble\",\"Goldback\"], \"10.642707,-71.612534\",[5.334, 2000.0]),\n", | |
" (\"ironhide&\", 26.0, \"Security\", 7, 4.0, [\"Roadbuster\"], \"37.789563,-122.400356\",[7.9248, 4000.0]),\n", | |
" (\"Jazz\",13.0, \"First Lieutenant\", 8, 1.8, [\"Meister\"], \"33.670666,-117.841553\",[3.9624, 1800.0]),\n", | |
" (\"Megatron\",None, \"None\", None, 5.7, [\"Megatron\"], None,[None,5700.0]),\n", | |
" (\"Metroplex_)^$\",300.0 , \"Battle Station\", 8, None, [\"Metroflex\"],None,[91.44, None]),\n", | |
" \n", | |
" ])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n", | |
"| names|height(ft)| function|rank|weight(t)| japanese name| last position| attributes|\n", | |
"+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n", | |
"| Optim'us| 28.0| Leader| 10| 4.3| [Inochi, Convoy]|19.442735,-99.201111|[8.5344, 4300.0]|\n", | |
"| bumbl#ebéé | 17.5| Espionage| 7| 2.0|[Bumble, Goldback]|10.642707,-71.612534| [5.334, 2000.0]|\n", | |
"| ironhide&| 26.0| Security| 7| 4.0| [Roadbuster]|37.789563,-122.40...|[7.9248, 4000.0]|\n", | |
"| Jazz| 13.0|First Lieutenant| 8| 1.8| [Meister]|33.670666,-117.84...|[3.9624, 1800.0]|\n", | |
"| Megatron| null| None|null| 5.7| [Megatron]| null| [, 5700.0]|\n", | |
"|Metroplex_)^$| 300.0| Battle Station| 8| null| [Metroflex]| null| [91.44,]|\n", | |
"+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# View the data with plain spark\n", | |
"df.show() # kinda ugly" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"\n", | |
"Creating a dataframe by passing a list of tuples specifyng the column data type. You can specify as data type an string or a Spark Datatypes. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/package-summary.html\n", | |
"\n", | |
"Also you can use some Optimus predefined types:\n", | |
"\n", | |
"- \"str\" = StringType()\n", | |
"- \"int\" = IntegerType()\n", | |
"- \"float\" = FloatType()\n", | |
"- \"bool\" = BoleanType()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n", | |
"| names|height(ft)| function|rank|weight(t)| japanese name| last position| attributes|\n", | |
"+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n", | |
"| Optim'us| 28.0| Leader| 10| 4.3| [Inochi, Convoy]|19.442735,-99.201111|[8.5344, 4300.0]|\n", | |
"| bumbl#ebéé | 17.5| Espionage| 7| 2.0|[Bumble, Goldback]|10.642707,-71.612534| [5.334, 2000.0]|\n", | |
"| ironhide&| 26.0| Security| 7| 4.0| [Roadbuster]|37.789563,-122.40...|[7.9248, 4000.0]|\n", | |
"| Jazz| 13.0|First Lieutenant| 8| 1.8| [Meister]|33.670666,-117.84...|[3.9624, 1800.0]|\n", | |
"| Megatron| null| None|null| 5.7| [Megatron]| null| [, 5700.0]|\n", | |
"|Metroplex_)^$| 300.0| Battle Station| 8| null| [Metroflex]| null| [91.44,]|\n", | |
"+-------------+----------+----------------+----+---------+------------------+--------------------+----------------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"f = op.create.df(\n", | |
" [\n", | |
" (\"names\", \"str\"),\n", | |
" (\"height\", \"float\"),\n", | |
" (\"function\", \"str\"),\n", | |
" (\"rank\", \"int\"),\n", | |
" ],\n", | |
" [\n", | |
" (\"bumbl#ebéé \", 17.5, \"Espionage\", 7),\n", | |
" (\"Optim'us\", 28.0, \"Leader\", 10),\n", | |
" (\"ironhide&\", 26.0, \"Security\", 7),\n", | |
" (\"Jazz\",13.0, \"First Lieutenant\", 8),\n", | |
" (\"Megatron\",None, \"None\", None),\n", | |
" \n", | |
" ])\n", | |
"df.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Creating a Daframe using a pandas dataframe" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<style>\n", | |
" .data_type {\n", | |
" font-size: 0.8em;\n", | |
" font-weight: normal;\n", | |
" }\n", | |
"\n", | |
" .column_name {\n", | |
" font-size: 1.2em;\n", | |
" }\n", | |
"\n", | |
" .info_items {\n", | |
" margin: 10px 0;\n", | |
" font-size: 0.8em;\n", | |
" }\n", | |
"\n", | |
" .optimus_table tr:nth-child(even) {\n", | |
" background-color: #f2f2f2 !important;\n", | |
" }\n", | |
"\n", | |
" .optimus_table tr:nth-child(odd) {\n", | |
" background-color: #ffffff !important;\n", | |
" }\n", | |
"\n", | |
" .optimus_table thead {\n", | |
" border-bottom: 1px solid black;\n", | |
" }\n", | |
" .optimus_table{\n", | |
" font-size: 12px;\n", | |
" }\n", | |
"\n", | |
"</style>\n", | |
"\n", | |
"\n", | |
"\n", | |
"\n", | |
"<div class=\"info_items\">Viewing 3 of 3 rows / 4 columns</div>\n", | |
"<div class=\"info_items\">1 partition(s)</div>\n", | |
"\n", | |
"<table class=\"optimus_table\">\n", | |
" <thead>\n", | |
" <tr>\n", | |
" \n", | |
" <th>\n", | |
" <div class=\"column_name\">names</div>\n", | |
" <div class=\"data_type\">1 (string)</div>\n", | |
" <div class=\"data_type\">\n", | |
" \n", | |
" nullable\n", | |
" \n", | |
" </div>\n", | |
" </th>\n", | |
" \n", | |
" <th>\n", | |
" <div class=\"column_name\">height</div>\n", | |
" <div class=\"data_type\">2 (double)</div>\n", | |
" <div class=\"data_type\">\n", | |
" \n", | |
" nullable\n", | |
" \n", | |
" </div>\n", | |
" </th>\n", | |
" \n", | |
" <th>\n", | |
" <div class=\"column_name\">function</div>\n", | |
" <div class=\"data_type\">3 (string)</div>\n", | |
" <div class=\"data_type\">\n", | |
" \n", | |
" nullable\n", | |
" \n", | |
" </div>\n", | |
" </th>\n", | |
" \n", | |
" <th>\n", | |
" <div class=\"column_name\">rank</div>\n", | |
" <div class=\"data_type\">4 (bigint)</div>\n", | |
" <div class=\"data_type\">\n", | |
" \n", | |
" nullable\n", | |
" \n", | |
" </div>\n", | |
" </th>\n", | |
" \n", | |
" </tr>\n", | |
"\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" \n", | |
" <tr>\n", | |
" \n", | |
" <td>\n", | |
" bumbl#ebéé⸱⸱\n", | |
" </td>\n", | |
" \n", | |
" <td>\n", | |
" Espionage\n", | |
" </td>\n", | |
" \n", | |
" <td>\n", | |
" 17.5\n", | |
" </td>\n", | |
" \n", | |
" <td>\n", | |
" 7\n", | |
" </td>\n", | |
" \n", | |
" </tr>\n", | |
" \n", | |
" <tr>\n", | |
" \n", | |
" <td>\n", | |
" Optim'us\n", | |
" </td>\n", | |
" \n", | |
" <td>\n", | |
" Leader\n", | |
" </td>\n", | |
" \n", | |
" <td>\n", | |
" 28.0\n", | |
" </td>\n", | |
" \n", | |
" <td>\n", | |
" 10\n", | |
" </td>\n", | |
" \n", | |
" </tr>\n", | |
" \n", | |
" <tr>\n", | |
" \n", | |
" <td>\n", | |
" ironhide&\n", | |
" </td>\n", | |
" \n", | |
" <td>\n", | |
" Security\n", | |
" </td>\n", | |
" \n", | |
" <td>\n", | |
" 26.0\n", | |
" </td>\n", | |
" \n", | |
" <td>\n", | |
" 7\n", | |
" </td>\n", | |
" \n", | |
" </tr>\n", | |
" \n", | |
" </tbody>\n", | |
"</table>\n", | |
"\n", | |
"<div class=\"info_items\">Viewing 3 of 3 rows / 4 columns</div>\n", | |
"<div class=\"info_items\">1 partition(s)</div>\n" | |
], | |
"text/plain": [ | |
"<IPython.core.display.HTML object>" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"import pandas as pd\n", | |
"import numpy as np\n", | |
"\n", | |
"data = [(\"bumbl#ebéé \", 17.5, \"Espionage\", 7),\n", | |
" (\"Optim'us\", 28.0, \"Leader\", 10),\n", | |
" (\"ironhide&\", 26.0, \"Security\", 7)]\n", | |
"labels = [\"names\", \"height\", \"function\", \"rank\"]\n", | |
"\n", | |
"# Create pandas dataframe\n", | |
"pdf = pd.DataFrame.from_records(data, columns=labels)\n", | |
"\n", | |
"df = op.create.df(pdf = pdf)\n", | |
"df.table()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Columns and Rows\n", | |
"\n", | |
"Optimus organized operations in columns and rows. This is a little different of how pandas works in which all operations are around the pandas class. We think this approach can better help you to access and transform data." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+---------+------+------------+----+\n", | |
"| function|height| names|rank|\n", | |
"+---------+------+------------+----+\n", | |
"|Espionage| 17.5|bumbl#ebéé | 7|\n", | |
"| Leader| 28.0| Optim'us| 10|\n", | |
"| Security| 26.0| ironhide&| 7|\n", | |
"+---------+------+------------+----+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.cols.sort().show()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+------------+------+---------+----+\n", | |
"| names|height| function|rank|\n", | |
"+------------+------+---------+----+\n", | |
"| Optim'us| 28.0| Leader| 10|\n", | |
"|bumbl#ebéé | 17.5|Espionage| 7|\n", | |
"| ironhide&| 26.0| Security| 7|\n", | |
"+------------+------+---------+----+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.rows.sort(\"rank\").show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Selection\n", | |
"\n", | |
"Unlike Pandas, Spark DataFrames don't support random row access. So methods like loc in pandas are not available.\n", | |
"\n", | |
"Also Pandas don't handle indexes. So methods like iloc are not available." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+------------+\n", | |
"| names|\n", | |
"+------------+\n", | |
"|bumbl#ebéé |\n", | |
"| Optim'us|\n", | |
"| ironhide&|\n", | |
"+------------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# Select an show an specific column\n", | |
"df.cols.select(\"names\").show()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+------------+\n", | |
"| names|\n", | |
"+------------+\n", | |
"|bumbl#ebéé |\n", | |
"| Optim'us|\n", | |
"| ironhide&|\n", | |
"+------------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# This works with plain Spark too\n", | |
"df.select(\"names\").show()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+--------+------+--------+----+\n", | |
"| names|height|function|rank|\n", | |
"+--------+------+--------+----+\n", | |
"|Optim'us| 28.0| Leader| 10|\n", | |
"+--------+------+--------+----+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# Select rows from a DataFrame where a the condition is meet\n", | |
"df.rows.select(df[\"rank\"]>7).show()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 13, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+--------+------+--------+----+\n", | |
"| names|height|function|rank|\n", | |
"+--------+------+--------+----+\n", | |
"|Optim'us| 28.0| Leader| 10|\n", | |
"+--------+------+--------+----+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# This is something like a filter in Spark\n", | |
"df.filter(df[\"rank\"]>7).show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This may seem weird, why we are overriding some functions, but it was needed to add more functionalities that Spark won't handle." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 14, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+------------+------+---------+----+---+\n", | |
"| names|height| function|rank| id|\n", | |
"+------------+------+---------+----+---+\n", | |
"|bumbl#ebéé | 17.5|Espionage| 7| 0|\n", | |
"| Optim'us| 28.0| Leader| 10| 1|\n", | |
"| ironhide&| 26.0| Security| 7| 2|\n", | |
"+------------+------+---------+----+---+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# Creating an id for a DF\n", | |
"df.create_id().show()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 15, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+------------+------+---------+----+-------+\n", | |
"| names|height| function|rank|new_col|\n", | |
"+------------+------+---------+----+-------+\n", | |
"|bumbl#ebéé | 17.5|Espionage| 7| 1|\n", | |
"| Optim'us| 28.0| Leader| 10| 1|\n", | |
"| ironhide&| 26.0| Security| 7| 1|\n", | |
"+------------+------+---------+----+-------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# Creating new columns\n", | |
"\n", | |
"df.cols.append(\"new_col\",1).show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In plain Spark to do the same you will need to do:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 16, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+------------+------+---------+----+-------+\n", | |
"| names|height| function|rank|new_col|\n", | |
"+------------+------+---------+----+-------+\n", | |
"|bumbl#ebéé | 17.5|Espionage| 7| 1|\n", | |
"| Optim'us| 28.0| Leader| 10| 1|\n", | |
"| ironhide&| 26.0| Security| 7| 1|\n", | |
"+------------+------+---------+----+-------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"from pyspark.sql.functions import lit\n", | |
"df.withColumn(\"new_col\", lit(1)).show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Which is not straightforward at all. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Creating a simple plot" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 27, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA1QAAAEHCAYAAACp5ActAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4xLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvDW2N/gAAFTVJREFUeJzt3X+w5WddH/D3h6zR2gARd0FINmzEUE2ZKrimoFZiodMEJctMqU0K1bSMGaEwiiiNYEOK1eGHA4Ux/oiV4YdICFTpti4NtcJQLMFswACbkLINCdkQSAiEkIkQVz7945wNJzd37948Obvn3OzrNXPnnu/zfc55PufMM+ee9/0+3++p7g4AAAD330MWXQAAAMBGJVABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQqAe1TVnqo6fdF1HClVta2quqo2Ddz3pKq6s6qOOZzjALDcBCqAo0RVXV9VT1/Rdm5VfejAdnf//e7+wCEeZ0OHg+lzfvMDfZzu/mx3H9fdfzuHmi6sqj9c0faBoyncAmxUAhUAS2WjBjUAjk4CFQD3mD2KVVWnVdXuqrqjqr5QVa+bdvvg9Pft0yVvT6mqh1TVr1bVDVV1S1W9taoePvO4Pz3dd1tV/fsV41xYVe+uqj+sqjuSnDsd+8NVdXtV3VxVv1VVx848XlfVC6rq01X11ar6tap6XFX9n2m9l872X4fnVNVnq+qLVfXymXEeUlXnV9X/m9Z+aVU9YrrvXkfqqurkqvrgtJ4/q6qLVh51Wm2cqjojycuS/Ivp63nV/agbgAUTqAA4mDckeUN3PyzJ45JcOm3/senv46dL3j6c5Nzpz48n+e4kxyX5rSSpqlOT/HaS5yR5dJKHJzlhxVg7krw7yfFJ3p7kb5O8OMnmJE9J8rQkL1hxn3+a5AeTPDnJS5NcnOS5SbYmeUKSc1Z7Ut395u4+d0Xzjyb5e9NxLqiq75u2vyjJs5I8Ncljknw5yUWrPW6SP0ryl0m+M8mFSf7VKn3uM053/48kv5HkndPX8/undZ5+qOWXACyeQAVwdHnP9KjP7VV1eyZB52D+Jsn3VNXm7r6zuy9fo+9zkryuu6/r7juT/EqSs6dHb56d5L9194e6++4kFyTpFff/cHe/p7u/0d1/3d1Xdvfl3b2/u69P8nuZhJpZr+nuO7p7T5JPJnnfdPyvJHlvkieu7yVJkvyH6bhXJbkqyfdP238uycu7e193fz2ToPTslcsSq+qkJD+U5ILuvru7P5Rk5/0YB4ANSqACOLo8q7uPP/CT+x71mfW8JI9P8qmquqKqfnKNvo9JcsPM9g1JNiV51HTfjQd2dPddSW5bcf8bZzeq6vFV9d+r6vPTZYC/kcnRqllfmLn916tsH7dGvSt9fub2XTP3fWySP5kJoNdkcvTsUSvu/5gkX5o+t1Wf0yHGAWCDEqgAWFV3f7q7z0nyyCSvTvLuqvq7ue/RpST5XCbh44CTkuzPJOTcnOTEAzuq6u9ksizuXsOt2P6dJJ9Kcsp0yeHLktT4sxl2Y5IzZ0Nod39bd9+0ot/NSR5RVd8+07b1foyz2msKwAYgUAGwqqp6blVt6e5vJLl92vyNJLdOf3/3TPd3JHnx9MIMx+Wb5wTtz+TcqGdW1Q9PLxRxYQ4djh6a5I4kd1bV9yZ5/rye1/30u0l+vaoemyRVtaWqdqzs1N03JNmd5MKqOraqnpLkmfdjnC8k2VZV/i4DbDDeuAE4mDOS7KmqOzO5QMXZ0/N/7kry60n+YroU7slJ3pTkbZlcAfAzSb6WyQUdMj3H6UVJLsnkSM6dSW5J8vU1xv6lJP8yyVeT/H6Sd87/6a3LGzI5F+p9VfXVJJcn+YcH6fucTC6gcVuS/5hJzWs9x1nvmv6+rao+Ol4uAEdadVtlAMCRMz2CdXsmy/k+s+h6DpeqemeST3X3KxZdCwCHjyNUABx2VfXMqvr26TlYv5nkE0muX2xV81VVPzT9LqyHTL9bakeS9yy6LgAOL4EKgCNhRyYXrvhcklMyWT74YFsi8V1JPpDJksY3Jnl+d39soRUBcNhZ8gcAADDIESoAAIBBAhUAAMCgTYsaePPmzb1t27ZFDQ8AAHBQV1555Re7e8uh+i0sUG3bti27d+9e1PAAAAAHVVU3rKefJX8AAACDBCoAAIBBAhUAAMAggQoAAGDQIQNVVb2pqm6pqk8eZH9V1Ruram9VfbyqnjT/MgEAAJbPeo5QvTnJGWvsPzPJKdOf85L8zgMvCwAAYPkdMlB19weTfGmNLjuSvLUnLk9yfFU9el4FAgAALKt5nEN1QpIbZ7b3TdsAAAAe1I7oF/tW1XmZLAvMSSeddCSHXpdt5//poktYGte/6icWXQIAsEH5TPVNa32m8jp900b+7DmPI1Q3Jdk6s33itO0+uvvi7t7e3du3bNkyh6EBAAAWZx6BameSn55e7e/JSb7S3TfP4XEBAACW2iGX/FXVO5KcnmRzVe1L8ook35Ik3f27SXYleUaSvUnuSvKvD1exAAAAy+SQgaq7zznE/k7yb+dWEQAAwAYxjyV/AAAARyWBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAoHUFqqo6o6quraq9VXX+KvtPqqr3V9XHqurjVfWM+ZcKAACwXA4ZqKrqmCQXJTkzyalJzqmqU1d0+9Ukl3b3E5OcneS3510oAADAslnPEarTkuzt7uu6++4klyTZsaJPJ3nY9PbDk3xufiUCAAAsp/UEqhOS3DizvW/aNuvCJM+tqn1JdiV50WoPVFXnVdXuqtp96623DpQLAACwPOZ1UYpzkry5u09M8owkb6uq+zx2d1/c3du7e/uWLVvmNDQAAMBirCdQ3ZRk68z2idO2Wc9LcmmSdPeHk3xbks3zKBAAAGBZrSdQXZHklKo6uaqOzeSiEztX9PlskqclSVV9XyaBypo+AADgQe2Qgaq79yd5YZLLklyTydX89lTVK6vqrGm3lyT52aq6Ksk7kpzb3X24igYAAFgGm9bTqbt3ZXKxidm2C2ZuX53kR+ZbGgAAwHKb10UpAAAAjjoCFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAg9YVqKrqjKq6tqr2VtX5B+nzU1V1dVXtqao/mm+ZAAAAy2fToTpU1TFJLkryT5LsS3JFVe3s7qtn+pyS5FeS/Eh3f7mqHnm4CgYAAFgW6zlCdVqSvd19XXffneSSJDtW9PnZJBd195eTpLtvmW+ZAAAAy2c9geqEJDfObO+bts16fJLHV9VfVNXlVXXGag9UVedV1e6q2n3rrbeOVQwAALAk5nVRik1JTklyepJzkvx+VR2/slN3X9zd27t7+5YtW+Y0NAAAwGKsJ1DdlGTrzPaJ07ZZ+5Ls7O6/6e7PJPm/mQQsAACAB631BKorkpxSVSdX1bFJzk6yc0Wf92RydCpVtTmTJYDXzbFOAACApXPIQNXd+5O8MMllSa5Jcml376mqV1bVWdNulyW5raquTvL+JL/c3bcdrqIBAACWwSEvm54k3b0rya4VbRfM3O4kvzj9AQAAOCrM66IUAAAARx2BCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQesKVFV1RlVdW1V7q+r8Nfr9s6rqqto+vxIBAACW0yEDVVUdk+SiJGcmOTXJOVV16ir9Hprk55N8ZN5FAgAALKP1HKE6Lcne7r6uu+9OckmSHav0+7Ukr07ytTnWBwAAsLTWE6hOSHLjzPa+ads9qupJSbZ295/OsTYAAICl9oAvSlFVD0nyuiQvWUff86pqd1XtvvXWWx/o0AAAAAu1nkB1U5KtM9snTtsOeGiSJyT5QFVdn+TJSXaudmGK7r64u7d39/YtW7aMVw0AALAE1hOorkhySlWdXFXHJjk7yc4DO7v7K929ubu3dfe2JJcnOau7dx+WigEAAJbEIQNVd+9P8sIklyW5Jsml3b2nql5ZVWcd7gIBAACW1ab1dOruXUl2rWi74CB9T3/gZQEAACy/B3xRCgAAgKOVQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwKB1BaqqOqOqrq2qvVV1/ir7f7Gqrq6qj1fV/6qqx86/VAAAgOVyyEBVVcckuSjJmUlOTXJOVZ26otvHkmzv7n+Q5N1JXjPvQgEAAJbNeo5QnZZkb3df1913J7kkyY7ZDt39/u6+a7p5eZIT51smAADA8llPoDohyY0z2/umbQfzvCTvfSBFAQAAbASb5vlgVfXcJNuTPPUg+89Lcl6SnHTSSfMcGgAA4IhbzxGqm5Jsndk+cdp2L1X19CQvT3JWd399tQfq7ou7e3t3b9+yZctIvQAAAEtjPYHqiiSnVNXJVXVskrOT7JztUFVPTPJ7mYSpW+ZfJgAAwPI5ZKDq7v1JXpjksiTXJLm0u/dU1Sur6qxpt9cmOS7Ju6rqr6pq50EeDgAA4EFjXedQdfeuJLtWtF0wc/vpc64LAABg6a3ri30BAAC4L4EKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBBAhUAAMAggQoAAGCQQAUAADBIoAIAABgkUAEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGCRQAQAADBKoAAAABglUAAAAgwQqAACAQQIVAADAIIEKAABgkEAFAAAwSKACAAAYJFABAAAMEqgAAAAGCVQAAACDBCoAAIBB6wpUVXVGVV1bVXur6vxV9n9rVb1zuv8jVbVt3oUCAAAsm0MGqqo6JslFSc5McmqSc6rq1BXdnpfky939PUlen+TV8y4UAABg2aznCNVpSfZ293XdfXeSS5LsWNFnR5K3TG+/O8nTqqrmVyYAAMDyWU+gOiHJjTPb+6Ztq/bp7v1JvpLkO+dRIAAAwLLadCQHq6rzkpw33byzqq49kuNvAJuTfHHRRSRJWbS5kSzNvGFDMW8YYd4w4qieNz5Trc8qr9MyzJvHrqfTegLVTUm2zmyfOG1brc++qtqU5OFJblv5QN19cZKL11PY0aiqdnf39kXXwcZi3jDCvGGEecMI84YRG2nerGfJ3xVJTqmqk6vq2CRnJ9m5os/OJD8zvf3sJH/e3T2/MgEAAJbPIY9Qdff+qnphksuSHJPkTd29p6pemWR3d+9M8gdJ3lZVe5N8KZPQBQAA8KC2rnOountXkl0r2i6Yuf21JP98vqUdlSyHZIR5wwjzhhHmDSPMG0ZsmHlTVuYBAACMWc85VAAAAKxCoFqQqnpTVd1SVZ+cafuBqrq8qv6qqnZX1WmLrJHlUlVbq+r9VXV1Ve2pqp+ftj+iqv5nVX16+vs7Fl0ry2ONefPaqvpUVX28qv6kqo5fdK0sj4PNm5n9L6mqrqrNi6qR5bPWvKmqF03fc/ZU1WsWWSfLZY2/Uxvmc7ElfwtSVT+W5M4kb+3uJ0zb3pfk9d393qp6RpKXdvfpCyyTJVJVj07y6O7+aFU9NMmVSZ6V5NwkX+ruV1XV+Um+o7v/3QJLZYmsMW9OzOSKrPurJt/+Yd5wwMHmTXdfXVVbk/znJN+b5Ae7e9HfE8OSWOP95lFJXp7kJ7r761X1yO6+ZZG1sjzWmDf/KRvkc7EjVAvS3R/M5IqI92pO8rDp7Ycn+dwRLYql1t03d/dHp7e/muSaJCck2ZHkLdNub8nkTQiSHHzedPf7unv/tNvlmQQsSLLm+02SvD7JSzP5mwX3WGPePD/Jq7r769N9whT3WGPebJjPxeu6yh9HzC8kuayqfjOTsPvDC66HJVVV25I8MclHkjyqu2+e7vp8Jv8JhPtYMW9m/Zsk7zzS9bAxzM6bqtqR5KbuvqqqFloXy23F+81rk/yjqvr1JF9L8kvdfcXiqmNZrZg3G+ZzsSNUy+X5SV7c3VuTvDiT7/eCe6mq45L8lyS/0N13zO6bfqG2/xpzHwebN1X18iT7k7x9UbWxvGbnTSbz5GVJLljzThz1Vnm/2ZTkEUmenOSXk1xaEjkrrDJvNsznYoFqufxMkj+e3n5XkqU9+Y7FqKpvyeTN5u3dfWCufGG6/vjAOmRLKbiXg8ybVNW5SX4yyXPaCbWssMq8eVySk5NcVVXXZ7JM9KNV9V2Lq5Jlc5D3m31J/rgn/jLJN5K4oAn3OMi82TCfiwWq5fK5JE+d3v7HST69wFpYMtP/5v1Bkmu6+3Uzu3Zm8qaT6e//eqRrY3kdbN5U1RmZnAdzVnfftaj6WE6rzZvu/kR3P7K7t3X3tkw+JD+puz+/wFJZImv8nXpPkh+f9nl8kmOTuJgJSdacNxvmc7Gr/C1IVb0jyemZ/IfmC0lekeTaJG/I5ND415K8oLuvXFSNLJeq+tEk/zvJJzL5714yWX7zkSSXJjkpyQ1Jfqq7V17whKPUGvPmjUm+Nclt07bLu/vnjnyFLKODzZvu3jXT5/ok213ljwPWeL/5syRvSvIDSe7O5ByqP19IkSydNebNHdkgn4sFKgAAgEGW/AEAAAwSqAAAAAYJVAAAAIMEKgAAgEECFQAAwCCBCgAAYJBABQAAMEigAgAAGPT/AcRQsAOVXnEbAAAAAElFTkSuQmCC\n", | |
"text/plain": [ | |
"<Figure size 864x360 with 1 Axes>" | |
] | |
}, | |
"metadata": { | |
"needs_background": "light" | |
}, | |
"output_type": "display_data" | |
} | |
], | |
"source": [ | |
"df.plot.hist(\"height\",10)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Yes now you can plot spark DF that easy. " | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Reading external data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 43, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n", | |
"| id| firstName| lastName|billingId| product|price| birth|dummyCol|\n", | |
"+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n", | |
"| 1| Luis| Alvarez$$%!| 123| Cake| 10|1980/07/07| never|\n", | |
"| 2| André| Ampère| 423| piza| 8|1950/07/08| gonna|\n", | |
"| 3| NiELS| Böhr//((%%| 551| pizza| 8|1990/07/09| give|\n", | |
"| 4| PAUL| dirac$| 521| pizza| 8|1954/07/10| you|\n", | |
"| 5| Albert| Einstein| 634| pizza| 8|1990/07/11| up|\n", | |
"| 6| Galileo| GALiLEI| 672| arepa| 5|1930/08/12| never|\n", | |
"| 7| CaRL| Ga%%%uss| 323| taco| 3|1970/07/13| gonna|\n", | |
"| 8| David| H$$$ilbert| 624| taaaccoo| 3|1950/07/14| let|\n", | |
"| 9| Johannes| KEPLER| 735| taco| 3|1920/04/22| you|\n", | |
"| 10| JaMES| M$$ax%%well| 875| taco| 3|1923/03/12| down|\n", | |
"| 11| Isaac| Newton| 992| pasta| 9|1999/02/15| never |\n", | |
"| 12| Emmy%%| Nöether$| 234| pasta| 9|1993/12/08| gonna|\n", | |
"| 13| Max!!!| Planck!!!| 111|hamburguer| 4|1994/01/04| run |\n", | |
"| 14| Fred| Hoy&&&le| 553| pizzza| 8|1997/06/27| around|\n", | |
"| 15|((( Heinrich )))))| Hertz| 116| pizza| 8|1956/11/30| and|\n", | |
"| 16| William| Gilbert###| 886| BEER| 2|1958/03/26| desert|\n", | |
"| 17| Marie| CURIE| 912| Rice| 1|2000/03/22| you|\n", | |
"| 18| Arthur| COM%%%pton| 812| 110790| 5|1899/01/01| #|\n", | |
"| 19| JAMES| Chadwick| 467| null| 10|1921/05/03| #|\n", | |
"+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df_csv = op.load.csv(\"foo.csv\", header=True)\n", | |
"df_csv.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Reading data from the Web" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n", | |
"| id| firstName| lastName|billingId| product|price| birth|dummyCol|\n", | |
"+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n", | |
"| 1| Luis| Alvarez$$%!| 123| Cake| 10|1980/07/07| never|\n", | |
"| 2| André| Ampère| 423| piza| 8|1950/07/08| gonna|\n", | |
"| 3| NiELS| Böhr//((%%| 551| pizza| 8|1990/07/09| give|\n", | |
"| 4| PAUL| dirac$| 521| pizza| 8|1954/07/10| you|\n", | |
"| 5| Albert| Einstein| 634| pizza| 8|1990/07/11| up|\n", | |
"| 6| Galileo| GALiLEI| 672| arepa| 5|1930/08/12| never|\n", | |
"| 7| CaRL| Ga%%%uss| 323| taco| 3|1970/07/13| gonna|\n", | |
"| 8| David| H$$$ilbert| 624| taaaccoo| 3|1950/07/14| let|\n", | |
"| 9| Johannes| KEPLER| 735| taco| 3|1920/04/22| you|\n", | |
"| 10| JaMES| M$$ax%%well| 875| taco| 3|1923/03/12| down|\n", | |
"| 11| Isaac| Newton| 992| pasta| 9|1999/02/15| never |\n", | |
"| 12| Emmy%%| Nöether$| 234| pasta| 9|1993/12/08| gonna|\n", | |
"| 13| Max!!!| Planck!!!| 111|hamburguer| 4|1994/01/04| run |\n", | |
"| 14| Fred| Hoy&&&le| 553| pizzza| 8|1997/06/27| around|\n", | |
"| 15|((( Heinrich )))))| Hertz| 116| pizza| 8|1956/11/30| and|\n", | |
"| 16| William| Gilbert###| 886| BEER| 2|1958/03/26| desert|\n", | |
"| 17| Marie| CURIE| 912| Rice| 1|2000/03/22| you|\n", | |
"| 18| Arthur| COM%%%pton| 812| 110790| 5|1899/01/01| #|\n", | |
"| 19| JAMES| Chadwick| 467| null| 10|1921/05/03| #|\n", | |
"+---+--------------------+--------------------+---------+----------+-----+----------+--------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df_url =op.load.url(\"https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/data/foo.csv\")\n", | |
"df_url.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Joins, merge, concat, etc." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Optimus provides and intuitive way to concat Dataframes by columns or rows. Operations like `join` and `group` are handle using Spark directly" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+------------+------+---------+----+\n", | |
"| names|height| function|rank|\n", | |
"+------------+------+---------+----+\n", | |
"|bumbl#ebéé | 17.5|Espionage| 7|\n", | |
"| Optim'us| 28.0| Leader| 10|\n", | |
"| ironhide&| 26.0| Security| 7|\n", | |
"+------------+------+---------+----+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# We have this dataframe\n", | |
"df.show()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 40, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Let's add a new column\n", | |
"df_new = op.create.df(\n", | |
" [\n", | |
" \"try\"\n", | |
" ],\n", | |
" [\n", | |
" (\"one\"),\n", | |
" (\"two\"),\n", | |
" (\"three\") \n", | |
" ])\n", | |
"\n", | |
"df_new = op.append([df,df_new], \"columns\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 41, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+------------+------+---------+----+-----+\n", | |
"| names|height| function|rank| try|\n", | |
"+------------+------+---------+----+-----+\n", | |
"|bumbl#ebéé | 17.5|Espionage| 7| one|\n", | |
"| Optim'us| 28.0| Leader| 10| two|\n", | |
"| ironhide&| 26.0| Security| 7|three|\n", | |
"+------------+------+---------+----+-----+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df_new.show()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 37, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"root\n", | |
" |-- names: string (nullable = true)\n", | |
" |-- height: double (nullable = true)\n", | |
" |-- function: string (nullable = true)\n", | |
" |-- rank: long (nullable = true)\n", | |
" |-- try: string (nullable = true)\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df_new.printSchema()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 38, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+------------+------+---------+----+-------+\n", | |
"| names|height| function|rank| try|\n", | |
"+------------+------+---------+----+-------+\n", | |
"|bumbl#ebéé | 17.5|Espionage| 7| one|\n", | |
"| Optim'us| 28.0| Leader| 10| two|\n", | |
"| ironhide&| 26.0| Security| 7| three|\n", | |
"| favio| 10.0| Leader| 1|new_row|\n", | |
"+------------+------+---------+----+-------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"# Let's add a new row\n", | |
"\n", | |
"new_row = [(\"favio\", 10, \"Leader\", 1, \"new_row\")]\n", | |
"df_new = df_new.rows.append(new_row).show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Doing fun stuff with columns and rows" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 42, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Let's create a dataframe :)\n", | |
"\n", | |
"\n", | |
"from pyspark.sql.types import StructType, StructField, StringType, BooleanType, IntegerType, ArrayType\n", | |
"\n", | |
"df = op.create.df(\n", | |
" [\n", | |
" (\"words\", \"str\", True),\n", | |
" (\"num\", \"int\", True),\n", | |
" (\"animals\", \"str\", True),\n", | |
" (\"thing\", StringType(), True),\n", | |
" (\"two strings\", StringType(), True),\n", | |
" (\"filter\", StringType(), True),\n", | |
" (\"num 2\", \"string\", True),\n", | |
" (\"col_array\", ArrayType(StringType()), True),\n", | |
" (\"col_int\", ArrayType(IntegerType()), True)\n", | |
"\n", | |
" ]\n", | |
",\n", | |
"[\n", | |
" (\" I like fish \", 1, \"dog\", \"housé\", \"cat-car\", \"a\",\"1\",[\"baby\", \"sorry\"],[1,2,3]),\n", | |
" (\" zombies\", 2, \"cat\", \"tv\", \"dog-tv\", \"b\",\"2\",[\"baby 1\", \"sorry 1\"],[3,4]),\n", | |
" (\"simpsons cat lady\", 2, \"frog\", \"table\",\"eagle-tv-plus\",\"1\",\"3\", [\"baby 2\", \"sorry 2\"], [5,6,7]),\n", | |
" (None, 3, \"eagle\", \"glass\", \"lion-pc\", \"c\",\"4\", [\"baby 3\", \"sorry 3\"] ,[7,8])\n", | |
" ])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 45, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"| words|num|animals|thing| two strings|filter|num 2| col_array| col_int|\n", | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"| I like fish | 1| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n", | |
"| zombies| 2| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n", | |
"|simpsons cat lady| 2| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n", | |
"| null| 3| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n", | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"With Optimus you can select columns with index or name, and even combined" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-------------------+---+-------+-----+\n", | |
"| words|num|animals|thing|\n", | |
"+-------------------+---+-------+-----+\n", | |
"| I like fish | 1| dog|housé|\n", | |
"| zombies| 2| cat| tv|\n", | |
"|simpsons cat lady| 2| frog|table|\n", | |
"| null| 3| eagle|glass|\n", | |
"+-------------------+---+-------+-----+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"columns = [\"words\", 1, \"animals\", 3]\n", | |
"df.cols.select(columns).show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Select columns with a Regex" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+---+-----+\n", | |
"|num|num 2|\n", | |
"+---+-----+\n", | |
"| 1| 1|\n", | |
"| 2| 2|\n", | |
"| 2| 3|\n", | |
"| 3| 4|\n", | |
"+---+-----+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.cols.select(\"n.*\", regex = True).show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Select all the columns of type string" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-----+-------------------+-------------+------+-----+-------+\n", | |
"|num 2| words| two strings|filter|thing|animals|\n", | |
"+-----+-------------------+-------------+------+-----+-------+\n", | |
"| 1| I like fish | cat-car| a|housé| dog|\n", | |
"| 2| zombies| dog-tv| b| tv| cat|\n", | |
"| 3|simpsons cat lady|eagle-tv-plus| 1|table| frog|\n", | |
"| 4| null| lion-pc| c|glass| eagle|\n", | |
"+-----+-------------------+-------------+------+-----+-------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.cols.select(\"*\", data_type = \"str\").show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Rename a columun" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-------------------+------+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"| words|number|animals|thing| two strings|filter|num 2| col_array| col_int|\n", | |
"+-------------------+------+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"| I like fish | 1| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n", | |
"| zombies| 2| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n", | |
"|simpsons cat lady| 2| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n", | |
"| null| 3| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n", | |
"+-------------------+------+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.cols.rename('num','number').show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Rename multiple columns and uppercase all the columns¶" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"| WORDS|NUM|ANIMALS|THING| TWO STRINGS|FILTER|NUM 2| COL_ARRAY| COL_INT|\n", | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"| I like fish | 1| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n", | |
"| zombies| 2| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n", | |
"|simpsons cat lady| 2| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n", | |
"| null| 3| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n", | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.cols.rename([('num','number'),(\"animals\",\"gods\")], str.upper).show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Convert to uppercase" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"| WORDS|NUM|ANIMALS|THING| TWO STRINGS|FILTER|NUM 2| COL_ARRAY| COL_INT|\n", | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"| I like fish | 1| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n", | |
"| zombies| 2| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n", | |
"|simpsons cat lady| 2| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n", | |
"| null| 3| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n", | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.cols.rename(str.upper).show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Sort columns in alphabetical order" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 53, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-------+-----------------+---------+------+---+-----+-----+-------------+-------------------+\n", | |
"|animals| col_array| col_int|filter|num|num 2|thing| two strings| words|\n", | |
"+-------+-----------------+---------+------+---+-----+-----+-------------+-------------------+\n", | |
"| dog| [baby, sorry]|[1, 2, 3]| a| 1| 1|housé| cat-car| I like fish |\n", | |
"| cat|[baby 1, sorry 1]| [3, 4]| b| 2| 2| tv| dog-tv| zombies|\n", | |
"| frog|[baby 2, sorry 2]|[5, 6, 7]| 1| 2| 3|table|eagle-tv-plus|simpsons cat lady|\n", | |
"| eagle|[baby 3, sorry 3]| [7, 8]| c| 3| 4|glass| lion-pc| null|\n", | |
"+-------+-----------------+---------+------+---+-----+-----+-------------+-------------------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.cols.sort(order = \"asc\").show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Drop multiple columns" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 54, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"|animals|thing| two strings|filter|num 2| col_array| col_int|\n", | |
"+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n", | |
"| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n", | |
"| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n", | |
"| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n", | |
"+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.cols.drop([\"num\",\"words\"]).show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Sort by row values" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 56, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"| words|num|animals|thing| two strings|filter|num 2| col_array| col_int|\n", | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"|simpsons cat lady| 2| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n", | |
"| null| 3| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n", | |
"| I like fish | 1| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n", | |
"| zombies| 2| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n", | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.rows.sort(\"animals\").show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Select by row" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 57, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-------------------+---+-------+-----+-----------+------+-----+-------------+---------+\n", | |
"| words|num|animals|thing|two strings|filter|num 2| col_array| col_int|\n", | |
"+-------------------+---+-------+-----+-----------+------+-----+-------------+---------+\n", | |
"| I like fish | 1| dog|housé| cat-car| a| 1|[baby, sorry]|[1, 2, 3]|\n", | |
"+-------------------+---+-------+-----+-----------+------+-----+-------------+---------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"df.rows.select(df[\"num\"]==1).show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Chaining\n", | |
"\n", | |
".cols y .rows attributes are used to organize and encapsulate optimus functionality apart from Apache Spark Dataframe API.\n", | |
"\n", | |
"At the same time it can be helpfull when you look at the code because every line is self explained.\n", | |
"\n", | |
"The past transformations were done step by step, but this can be achieved by chaining all operations into one line of code, like the cell below. This way is much more efficient and scalable because it uses all optimization issues from the lazy evaluation approach." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 55, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Orginal DF\n", | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"| words|num|animals|thing| two strings|filter|num 2| col_array| col_int|\n", | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"| I like fish | 1| dog|housé| cat-car| a| 1| [baby, sorry]|[1, 2, 3]|\n", | |
"| zombies| 2| cat| tv| dog-tv| b| 2|[baby 1, sorry 1]| [3, 4]|\n", | |
"|simpsons cat lady| 2| frog|table|eagle-tv-plus| 1| 3|[baby 2, sorry 2]|[5, 6, 7]|\n", | |
"| null| 3| eagle|glass| lion-pc| c| 4|[baby 3, sorry 3]| [7, 8]|\n", | |
"+-------------------+---+-------+-----+-------------+------+-----+-----------------+---------+\n", | |
"\n", | |
"New DF\n", | |
"+-----------+-----+-----+---------+---------+------+---------+-----------------+-------+\n", | |
"|two strings|thing|num 2|new_col_2|new_col_1|filter| col_int| col_array|animals|\n", | |
"+-----------+-----+-----+---------+---------+------+---------+-----------------+-------+\n", | |
"| cat-car|housé| 1|spongebob| 1| a|[1, 2, 3]| [baby, sorry]| dog|\n", | |
"| dog-tv| tv| 2|spongebob| 1| b| [3, 4]|[baby 1, sorry 1]| cat|\n", | |
"| lion-pc|glass| 4|spongebob| 1| c| [7, 8]|[baby 3, sorry 3]| eagle|\n", | |
"+-----------+-----+-----+---------+---------+------+---------+-----------------+-------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"print(\"Orginal DF\")\n", | |
"df.show()\n", | |
"\n", | |
"print(\"New DF\")\n", | |
"df\\\n", | |
" .cols.rename([('num','number')])\\\n", | |
" .cols.drop([\"number\",\"words\"])\\\n", | |
" .withColumn(\"new_col_2\", lit(\"spongebob\"))\\\n", | |
" .cols.append(\"new_col_1\", 1)\\\n", | |
" .cols.sort(order= \"desc\")\\\n", | |
" .rows.drop(df[\"num 2\"] == 3)\\\n", | |
" .show()" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.5.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment