You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here I will show the techniques I use to manipulate data using python. These techniques might change depending if I found a more efficient way of performing the task.
import os
os.getcwd() # gets the current path
os.listdir() # gets the list of folders in current path
os.listdir("folder_name") # gets the list of files or folders in the stated folder
Available Methods
Sometimes we would like to see the methods available to either explore or in case we forgot the correct spelling.
df. # Press the tab key to see all available methods for the dataframe object
df.co # Pressing tab on this one will only show methods starting with "co"
Renaming Columns & Lowering Characters
I expplore the features available. Just in case I need to change names I change the names and then convert all metadata to lower.
# show the columns
df.columns
# change name of columns by index
df.columns = ['new', 'name', 'for', 'columns']
# change name of specific columns
df.rename(index=str, columns={"col1": "colnew1", "col2": "colnew2"})
# convert all characters to lower
df.columns = df.columns.str.lower()
Show all Data Types
Then I check the data types loaded.
df.dtypes
I see the data available in the frame.
df.head(10) or df.tail(10)
Then I check the descriptive statistics of quantitative values to have a sense of the distribution.
df.describe()
df.mean(axis=0)
df.value_counts()
Lower All Metadata
df.columns = [x.lower() for x in df.columns]
df.columns = map(lambda x: x.lower(), df.columns)
Identifying Null Values
I identify null values by using the isnull() method
df.isnull().sum()
To get the total sum of null values in the dataset, we just add another sum method.
df.isnull().sum().sum()
Clean Data
Dates
When cleaning dates, I use the .to_datetime() method with a coerce parameter. Then, I save it to anoter column to see the column with the errors by slicing into a dataframe and showing only the null values. After validating, I reasign.
df['date_ver'] = pd.to_datetime(df.date, errors='coerce')
df[['date', 'date_ver']][df_int.date_ver.isnull()]
# after validating or cleaning the data
df.date = df.date_ver
Numeric
df['col1_ver'] = pd.to_numeric(df.col1, errors='coerce')
df[df.col1_ver.isnull()]
# looking for a way to show only those rows with any true condition
Nulls
After running the isnull().sum() method combination, I check combinations of validation check-ups. I start showing all rows filtering by the column with most null values to compare. After finding patterns, I go directly to those affected.
# shwoing all rows affected
df[df.column_most_null.isnull()]
# showing targeted rows
df[['col1', 'col2', 'col2']][df.colaffected.isnul()]
# replace nan
df.fillna(0)
df.fillna(df.mean(axis=0))
df.fillna(method='ffill')
df.fillna(method='bfill')
# drop rows with nan
df.dropna(axis=0)
df.dropna(axis=0, thresh=4)
# drop cells with nan
df.dropna(axis=1)
# drop specific rows
drop_this = df[df.col1.isnull()].index
df.drop(drop_this, inplace=True)
# Reset index after dropping
df.reset_index
df.reset_index(drop=True, inplace=True)
Columns & Rows
In any case there are extra columns and rows, I use drop for the extra columns and slicing for the extra rows.
# deleting extra columns (last 5 in this case)
df = df.drop(df.columns[-5:], axis=1)
# deleting extra rows (last 1 in this case)
df = df[:-1]
Finding Characters & Setting Conditions
To find characters, I use the contain() method and create conditions to query only rows that meet those conditions.
df.col1.str.contains(' ')
condition = (df.col1.str.contain(' ')) & (df.col2.isnull()) # always encapsulate in parenthesis to avoid unwated errors
df.col1[condition]
Replacing Values in Dataframe
df.loc[condition, 'col1'] = new_value
Splitting Values on Condition
# Create series of lists
df.col1[condition].apply(lambda x: x.split(' '))
If I am handling text, I use the CountVectorizer class with the fit_transform(), get_feature_names(), and toarray() methods.
from sklearn.feature_extraction.text import CountVectorizer
# set the object
bag = CountVectorizer()
# get unique words
bag.get_feature_names()
# get the count of unique words per list
bag_count = bag.fit_transform(words_list).toarray()
Using Stop Words
from nltk.corpus import stopwords
sw = stopwords.words("english")
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
stemmer.stem("word")