Loading and Dumping Data

Magic Command

%matplotlib notebook  # interactive render
%matplotlib inline    # create plot instantly
%matplotlib gtk       # create plot in new window

Import Useful Packages

import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import sys
from sqlalchemy import create_engine
import sqlite3
from plyfile import PlyData, PlyElement # install using pip install plyfile

Creating Connections and Setting Aesthetics

connection = "database_name"
db = sqlite3.connect(connection)
matplotlib.style.use('ggplot') # I love ggplot looks: the grammar of graphics

If the dataset is not in the current folder of the jupyter notebook, the path of the dataset is needed.

dataset_path = 'C:\dataset_path\...'
if dataset_path not in sys.path:
  sys.path.append(dataset_path)

I always load the data using pandas. It is just the most intiutive way to do it.

path = ""
df = pd.read_excel(path + '\\file_name.ext', 'sheetname', na_values=['NA', '?'])
df = pd.read_csv(path + '\\file_name.ext', sep=',', index_col = 0) # index_col creates index
df = pd.read_html('url')
df = pd.read_json(path + '\\file_name.ext', orient='columns')
df = pd.read_sql_query('SELECT * FROM table_name', db)
plyfile = PlyData.read('folder/filename')
connection.close()

I dump the data using pandas as well.

df.to_excel('file_name.ext')
df.to_json('file_name.ext')
df.to_csv('file_name.ext')

Initial Exploration

Navigating Path and Folders

import os
os.getcwd()               # gets the current path
os.listdir()              # gets the list of folders in current path
os.listdir("folder_name") # gets the list of files or folders in the stated folder

Available Methods

Sometimes we would like to see the methods available to either explore or in case we forgot the correct spelling.

df. # Press the tab key to see all available methods for the dataframe object
df.co # Pressing tab on this one will only show methods starting with "co"

Renaming Columns & Lowering Characters

I expplore the features available. Just in case I need to change names I change the names and then convert all metadata to lower.

# show the columns
df.columns

# change name of columns by index
df.columns = ['new', 'name', 'for', 'columns']

# change name of specific columns
df.rename(index=str, columns={"col1": "colnew1", "col2": "colnew2"})

# convert all characters to lower
df.columns = df.columns.str.lower()

Show all Data Types

Then I check the data types loaded.

df.dtypes

I see the data available in the frame.

df.head(10) or df.tail(10)

Then I check the descriptive statistics of quantitative values to have a sense of the distribution.

df.describe()
df.mean(axis=0)
df.value_counts()

Lower All Metadata

df.columns = [x.lower() for x in df.columns]
df.columns = map(lambda x: x.lower(), df.columns)

Identifying Null Values

I identify null values by using the isnull() method

df.isnull().sum()

To get the total sum of null values in the dataset, we just add another sum method.

df.isnull().sum().sum()

Clean Data

Dates

When cleaning dates, I use the .to_datetime() method with a coerce parameter. Then, I save it to anoter column to see the column with the errors by slicing into a dataframe and showing only the null values. After validating, I reasign.

df['date_ver'] = pd.to_datetime(df.date, errors='coerce')
df[['date', 'date_ver']][df_int.date_ver.isnull()]

# after validating or cleaning the data
df.date = df.date_ver

Numeric

df['col1_ver'] = pd.to_numeric(df.col1, errors='coerce')
df[df.col1_ver.isnull()]
# looking for a way to show only those rows with any true condition

Nulls

After running the isnull().sum() method combination, I check combinations of validation check-ups. I start showing all rows filtering by the column with most null values to compare. After finding patterns, I go directly to those affected.

# shwoing all rows affected
df[df.column_most_null.isnull()]

# showing targeted rows
df[['col1', 'col2', 'col2']][df.colaffected.isnul()]

# replace nan
df.fillna(0)
df.fillna(df.mean(axis=0))
df.fillna(method='ffill')
df.fillna(method='bfill')

# drop rows with nan
df.dropna(axis=0)
df.dropna(axis=0, thresh=4)

# drop cells with nan
df.dropna(axis=1)

# drop specific rows
drop_this = df[df.col1.isnull()].index
df.drop(drop_this, inplace=True)

# Reset index after dropping
df.reset_index
df.reset_index(drop=True, inplace=True)

Columns & Rows

In any case there are extra columns and rows, I use drop for the extra columns and slicing for the extra rows.

# deleting extra columns (last 5 in this case)
df = df.drop(df.columns[-5:], axis=1)

# deleting extra rows (last 1 in this case)
df = df[:-1]

Finding Characters & Setting Conditions

To find characters, I use the contain() method and create conditions to query only rows that meet those conditions.

df.col1.str.contains(' ')
condition = (df.col1.str.contain(' ')) & (df.col2.isnull()) # always encapsulate in parenthesis to avoid unwated errors
df.col1[condition]

Replacing Values in Dataframe

df.loc[condition, 'col1'] = new_value

Splitting Values on Condition

# Create series of lists
df.col1[condition].apply(lambda x: x.split(' '))

Replace Non-numbers

df.col1.apply(lambda x: re.sub('[^0-9]', '', str(x)))

Removing Duplicates

df.drop_duplicates(subset=['col1', 'col2', 'col2'], inplace=True)

Slicing Target or Dependent Variable

When slicing just the target, I use the first approach. It provides a series.

# Series Slicing
df.col1
df['col1']
df.loc[:, 'col1']
df.iloc[:, 0]

# DataFrame Slicing
df[['col1']]
df.loc[:, ['col1']]
df.iloc[:, [0]]

After slicing the target I take it out of the dataframe.

df.pop('col1')

Encoding Qualitative Features

To encode my target, I use the astype().cat.codes method.

df['feature_name_ec'] = df.feature_name.astype("category").cat.codes

When the categorical feature is ordinal in nature, I use the ordered and categories parameters.

ordered_features = ['ord1', 'ord2', 'ord3']
df['feature_name_ec] = df.feature_name.astype("category", ordered=True, categories=ordered_features).cat.codes

Using Bag of Words

If I am handling text, I use the CountVectorizer class with the fit_transform(), get_feature_names(), and toarray() methods.

from sklearn.feature_extraction.text import CountVectorizer
# set the object
bag = CountVectorizer()

# get unique words
bag.get_feature_names()

# get the count of unique words per list
bag_count = bag.fit_transform(words_list).toarray()

Using Stop Words

from nltk.corpus import stopwords
sw = stopwords.words("english")
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer.stem("word")

samlexrod/My Python Data Methodology.md

Loading and Dumping Data

Magic Command

Import Useful Packages

Creating Connections and Setting Aesthetics

Initial Exploration

Navigating Path and Folders

Available Methods

Renaming Columns & Lowering Characters

Show all Data Types

Lower All Metadata

Identifying Null Values

Clean Data

Dates

Numeric

Nulls

Columns & Rows

Finding Characters & Setting Conditions

Replacing Values in Dataframe

Splitting Values on Condition

Replace Non-numbers

Removing Duplicates

Slicing Target or Dependent Variable

Encoding Qualitative Features

Using Bag of Words

Using Stop Words