Skip to content

Instantly share code, notes, and snippets.

@woobe
Created September 11, 2018 19:27
Show Gist options
  • Save woobe/dc16a307e4323d9623aee4749856c725 to your computer and use it in GitHub Desktop.
Save woobe/dc16a307e4323d9623aee4749856c725 to your computer and use it in GitHub Desktop.
Example notebook for Airline Sentiment using Driverless AI's Python API
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Driverless AI NLP Demo - Airline Sentiment Dataset ###\n",
"\n",
"In this notebook, we will see how to use Driverless AI python client to build text classification models using the Airline sentiment twitter dataset.\n",
"\n",
"Import the necessary python modules to get started including the Driverless AI client. If not already installed, please download the python client from Driverless AI GUI and install the same.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import h2oai_client\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn import model_selection\n",
"from h2oai_client import Client"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The below code downloads the twitter airline sentiment dataset and save it in the current folder. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2018-09-11 12:12:42-- https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv\n",
"Resolving www.figure-eight.com (www.figure-eight.com)... 52.0.208.137, 52.3.39.167\n",
"Connecting to www.figure-eight.com (www.figure-eight.com)|52.0.208.137|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 3704908 (3.5M) [application/octet-stream]\n",
"Saving to: ‘Airline-Sentiment-2-w-AA.csv’\n",
"\n",
"Airline-Sentiment-2 100%[===================>] 3.53M 660KB/s in 5.5s \n",
"\n",
"2018-09-11 12:12:50 (660 KB/s) - ‘Airline-Sentiment-2-w-AA.csv’ saved [3704908/3704908]\n",
"\n"
]
}
],
"source": [
"! wget https://www.figure-eight.com/wp-content/uploads/2016/03/Airline-Sentiment-2-w-AA.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now split the dataset into train and test files so as to build models. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"al = pd.read_csv(\"Airline-Sentiment-2-w-AA.csv\", encoding='ISO-8859-1')\n",
"train_al, test_al = model_selection.train_test_split(al, test_size=0.2, random_state=2018)\n",
"train_al.to_csv(\"train_airline_sentiment.csv\", index=False)\n",
"test_al.to_csv(\"test_airline_sentiment.csv\", index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first step is to establish a connection to Driverless AI using `Client`. Please key in your credentials and the url address."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"h2o = Client(address='http://localhost:12345', username='h2oai', password='h2oai')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read the train and test files into Driverless AI using the `create_dataset_sync` command."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"train_path = './train_airline_sentiment.csv'\n",
"test_path = './test_airline_sentiment.csv'\n",
"\n",
"train = h2o.create_dataset_sync(train_path)\n",
"test = h2o.create_dataset_sync(test_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let us look at some basic information about the dataset. To check the number of columns and rows in the dataset."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Train Dataset: 20 x 11712\n",
"Test Dataset: 20 x 2928\n"
]
}
],
"source": [
"print('Train Dataset: ', len(train.columns), 'x', train.row_count)\n",
"print('Test Dataset: ', len(test.columns), 'x', test.row_count)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get the names of the columns in the training set."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['_unit_id',\n",
" '_golden',\n",
" '_unit_state',\n",
" '_trusted_judgments',\n",
" '_last_judgment_at',\n",
" 'airline_sentiment',\n",
" 'airline_sentiment:confidence',\n",
" 'negativereason',\n",
" 'negativereason:confidence',\n",
" 'airline',\n",
" 'airline_sentiment_gold',\n",
" 'name',\n",
" 'negativereason_gold',\n",
" 'retweet_count',\n",
" 'text',\n",
" 'tweet_coord',\n",
" 'tweet_created',\n",
" 'tweet_id',\n",
" 'tweet_location',\n",
" 'user_timezone']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[c.name for c in train.columns]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We just need two columns for our experiment. `text` which contains the text of the tweet and `airline_sentiment` which contains the sentiment of the tweet (target column). We can drop the remaining columns for this experiment. Let us get a preview for the same.\n",
"\n",
"Also please set enable the tensorflow models by setting `enable_tensorflow=\"on\"` if you have a GPU. This will help in creating the CNN based text features."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['ACCURACY [6/10]:',\n",
" '- Training data size: *11,712 rows, 2 cols*',\n",
" '- Feature evolution: *XGBoost*, *time-based validation*',\n",
" '- Final pipeline: *XGBoost*',\n",
" '',\n",
" 'TIME [4/10]:',\n",
" '- Feature evolution: *4 individuals*, up to *52 iterations*',\n",
" '- Early stopping: After *5* iterations of no improvement',\n",
" '',\n",
" 'INTERPRETABILITY [8/10]:',\n",
" '- Feature pre-pruning strategy: FS',\n",
" '- Monotonicity constraints: enabled',\n",
" '- Feature engineering search space (where applicable): [Date, Identity, Interactions, Lags, Text, TextCNN, WeightOfEvidence]',\n",
" '',\n",
" 'XGBoost models to train:',\n",
" '- Model and feature tuning: *72*',\n",
" '- Feature evolution: *252*',\n",
" '- Final pipeline: *1*',\n",
" '',\n",
" 'Estimated max. total memory usage:',\n",
" '- Feature engineering: *144.0MB*',\n",
" '- GPU XGBoost: *8.0MB*']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"exp_preview = h2o.get_experiment_preview_sync(\n",
" dataset_key=train.key,\n",
" validset_key='',\n",
" target_col='airline_sentiment',\n",
" classification=True,\n",
" dropped_cols=[\"_unit_id\", \"_golden\", \"_unit_state\", \"_trusted_judgments\", \"_last_judgment_at\",\n",
" \"airline_sentiment:confidence\", \"negativereason\", \"negativereason:confidence\", \"airline\",\n",
" \"airline_sentiment_gold\", \"name\", \"negativereason_gold\", \"retweet_count\", \n",
" \"tweet_coord\", \"tweet_created\", \"tweet_id\", \"tweet_location\", \"user_timezone\"],\n",
" accuracy=6,\n",
" time=4,\n",
" interpretability=8,\n",
" time_col='',\n",
" enable_gpus=True,\n",
" config_overrides='enable_tensorflow=\"on\"'\n",
")\n",
"exp_preview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please note that the `Text` and `TextCNN` features are enabled for this experiment.\n",
"\n",
"Now we can start the experiment."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [],
"source": [
"model = h2o.start_experiment_sync(\n",
" dataset_key=train.key,\n",
" testset_key=test.key,\n",
" target_col='airline_sentiment',\n",
" scorer=None,\n",
" is_classification=True,\n",
" cols_to_drop=[\"_unit_id\", \"_golden\", \"_unit_state\", \"_trusted_judgments\", \"_last_judgment_at\",\n",
" \"airline_sentiment:confidence\", \"negativereason\", \"negativereason:confidence\", \"airline\",\n",
" \"airline_sentiment_gold\", \"name\", \"negativereason_gold\", \"retweet_count\", \n",
" \"tweet_coord\", \"tweet_created\", \"tweet_id\", \"tweet_location\", \"user_timezone\"],\n",
" accuracy=6,\n",
" time=2,\n",
" interpretability=8,\n",
" time_col='',\n",
" enable_gpus=True,\n",
" config_overrides='enable_tensorflow=\"on\"'\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Modeling completed for model pakimeto\n"
]
}
],
"source": [
"print('Modeling completed for model ' + model.key)"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Logs available at h2oai_experiment_pakimeto/h2oai_experiment_logs_pakimeto.zip\n"
]
}
],
"source": [
"print('Logs available at', model.log_file_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can download the predictions to the current folder."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Test set predictions available at ./test_preds.csv\n"
]
}
],
"source": [
"test_preds = h2o.download(model.test_predictions_path, '.')\n",
"print('Test set predictions available at', test_preds)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment