Skip to content

Instantly share code, notes, and snippets.

@samlexrod
Last active March 23, 2018 00:06
Show Gist options
  • Save samlexrod/268d4fc11347c757a55a25c7478d1829 to your computer and use it in GitHub Desktop.
Save samlexrod/268d4fc11347c757a55a25c7478d1829 to your computer and use it in GitHub Desktop.
Here I will show the techniques I use to manipulate data using python. These techniques might change depending if I found a more efficient way of performing the task.

Loading and Dumping Data

Magic Command

%matplotlib notebook  # interactive render
%matplotlib inline    # create plot instantly
%matplotlib gtk       # create plot in new window

Import Useful Packages

import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import sys
from sqlalchemy import create_engine
import sqlite3
from plyfile import PlyData, PlyElement # install using pip install plyfile

Creating Connections and Setting Aesthetics

connection = "database_name"
db = sqlite3.connect(connection)
matplotlib.style.use('ggplot') # I love ggplot looks: the grammar of graphics
  • If the dataset is not in the current folder of the jupyter notebook, the path of the dataset is needed.
dataset_path = 'C:\dataset_path\...'
if dataset_path not in sys.path:
  sys.path.append(dataset_path)
  • I always load the data using pandas. It is just the most intiutive way to do it.
path = ""
df = pd.read_excel(path + '\\file_name.ext', 'sheetname', na_values=['NA', '?'])
df = pd.read_csv(path + '\\file_name.ext', sep=',', index_col = 0) # index_col creates index
df = pd.read_html('url')
df = pd.read_json(path + '\\file_name.ext', orient='columns')
df = pd.read_sql_query('SELECT * FROM table_name', db)
plyfile = PlyData.read('folder/filename')
connection.close()
  • I dump the data using pandas as well.
df.to_excel('file_name.ext')
df.to_json('file_name.ext')
df.to_csv('file_name.ext')

Initial Exploration

Navigating Path and Folders

import os
os.getcwd()               # gets the current path
os.listdir()              # gets the list of folders in current path
os.listdir("folder_name") # gets the list of files or folders in the stated folder

Available Methods

  • Sometimes we would like to see the methods available to either explore or in case we forgot the correct spelling.
df. # Press the tab key to see all available methods for the dataframe object
df.co # Pressing tab on this one will only show methods starting with "co"

Renaming Columns & Lowering Characters

  • I expplore the features available. Just in case I need to change names I change the names and then convert all metadata to lower.
# show the columns
df.columns

# change name of columns by index
df.columns = ['new', 'name', 'for', 'columns']

# change name of specific columns
df.rename(index=str, columns={"col1": "colnew1", "col2": "colnew2"})

# convert all characters to lower
df.columns = df.columns.str.lower()

Show all Data Types

  • Then I check the data types loaded.
df.dtypes
  • I see the data available in the frame.
df.head(10) or df.tail(10)
  • Then I check the descriptive statistics of quantitative values to have a sense of the distribution.
df.describe()
df.mean(axis=0)
df.value_counts()

Lower All Metadata

df.columns = [x.lower() for x in df.columns]
df.columns = map(lambda x: x.lower(), df.columns)

Identifying Null Values

  • I identify null values by using the isnull() method
df.isnull().sum()
  • To get the total sum of null values in the dataset, we just add another sum method.
df.isnull().sum().sum()

Clean Data

Dates

  • When cleaning dates, I use the .to_datetime() method with a coerce parameter. Then, I save it to anoter column to see the column with the errors by slicing into a dataframe and showing only the null values. After validating, I reasign.
df['date_ver'] = pd.to_datetime(df.date, errors='coerce')
df[['date', 'date_ver']][df_int.date_ver.isnull()]

# after validating or cleaning the data
df.date = df.date_ver

Numeric

df['col1_ver'] = pd.to_numeric(df.col1, errors='coerce')
df[df.col1_ver.isnull()]
# looking for a way to show only those rows with any true condition

Nulls

  • After running the isnull().sum() method combination, I check combinations of validation check-ups. I start showing all rows filtering by the column with most null values to compare. After finding patterns, I go directly to those affected.
# shwoing all rows affected
df[df.column_most_null.isnull()]

# showing targeted rows
df[['col1', 'col2', 'col2']][df.colaffected.isnul()]

# replace nan
df.fillna(0)
df.fillna(df.mean(axis=0))
df.fillna(method='ffill')
df.fillna(method='bfill')

# drop rows with nan
df.dropna(axis=0)
df.dropna(axis=0, thresh=4)

# drop cells with nan
df.dropna(axis=1)

# drop specific rows
drop_this = df[df.col1.isnull()].index
df.drop(drop_this, inplace=True)

# Reset index after dropping
df.reset_index
df.reset_index(drop=True, inplace=True)

Columns & Rows

  • In any case there are extra columns and rows, I use drop for the extra columns and slicing for the extra rows.
# deleting extra columns (last 5 in this case)
df = df.drop(df.columns[-5:], axis=1)

# deleting extra rows (last 1 in this case)
df = df[:-1]

Finding Characters & Setting Conditions

  • To find characters, I use the contain() method and create conditions to query only rows that meet those conditions.
df.col1.str.contains(' ')
condition = (df.col1.str.contain(' ')) & (df.col2.isnull()) # always encapsulate in parenthesis to avoid unwated errors
df.col1[condition]

Replacing Values in Dataframe

df.loc[condition, 'col1'] = new_value

Splitting Values on Condition

# Create series of lists
df.col1[condition].apply(lambda x: x.split(' '))

Replace Non-numbers

df.col1.apply(lambda x: re.sub('[^0-9]', '', str(x)))

Removing Duplicates

df.drop_duplicates(subset=['col1', 'col2', 'col2'], inplace=True)

Slicing Target or Dependent Variable

  • When slicing just the target, I use the first approach. It provides a series.
# Series Slicing
df.col1
df['col1']
df.loc[:, 'col1']
df.iloc[:, 0]
# DataFrame Slicing
df[['col1']]
df.loc[:, ['col1']]
df.iloc[:, [0]]
  • After slicing the target I take it out of the dataframe.
df.pop('col1')

Encoding Qualitative Features

  • To encode my target, I use the astype().cat.codes method.
df['feature_name_ec'] = df.feature_name.astype("category").cat.codes
  • When the categorical feature is ordinal in nature, I use the ordered and categories parameters.
ordered_features = ['ord1', 'ord2', 'ord3']
df['feature_name_ec] = df.feature_name.astype("category", ordered=True, categories=ordered_features).cat.codes

Using Bag of Words

  • If I am handling text, I use the CountVectorizer class with the fit_transform(), get_feature_names(), and toarray() methods.
from sklearn.feature_extraction.text import CountVectorizer
# set the object
bag = CountVectorizer()

# get unique words
bag.get_feature_names()

# get the count of unique words per list
bag_count = bag.fit_transform(words_list).toarray()

Using Stop Words

from nltk.corpus import stopwords
sw = stopwords.words("english")
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer.stem("word")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment