Skip to content

Instantly share code, notes, and snippets.

@datlife
Last active March 14, 2018 19:41
Show Gist options
  • Save datlife/21b48fa04440e9530115c0d7d5061b23 to your computer and use it in GitHub Desktop.
Save datlife/21b48fa04440e9530115c0d7d5061b23 to your computer and use it in GitHub Desktop.
Comparisons: How to efficiently iterate in pandas data frames by row.
"""
Problem:
--------
We would like to explore which method perform row iteration in the most efficient way.
Create a data frammes with 3 columns and 100,000 rows
Results:
--------
vector: Iterated over 100000 rows in 0.029180 | Sample at idx [0]: (1, 100000)
zip: Iterated over 100000 rows in 0.073447 | Sample at idx [0]: [1, 100000]
itertuple:Iterated over 100000 rows in 0.130544 | Sample at idx [0]: [1, 100000]
to_dict: Iterated over 100000 rows in 1.149689 | Sample at idx [0]: [1, 100000]
iterrows: Iterated over 100000 rows in 4.680544 | Sample at idx [0]: [1, 100000]
Conclusion: Avoid iterrows(), try vector / zip for performance
"""
import time
import pandas as pd
df = pd.DataFrame({'col1': range(0, 100000), 'col2': range(100000, 200000),
'col3': range(200000, 300000)})
s = time.time()
result = zip(df['col1'].apply(lambda i: i + 1), df['col2'])
print('vector: Iterated over {} rows in {:4f}s | Sample at idx [0]: {}'.format(
len(df), time.time() - s, result[0]))
s = time.time()
result = [[a + 1, b] for a, b in zip(df['col1'], df['col2'])]
print('zip: Iterated over {} rows in {:4f}s | Sample at idx [0]: {}'.format(
len(df), time.time() - s, result[0]))
s = time.time()
result = [[ir.col1 + 1, ir.col2] for ir in df.itertuples()]
print('itertuple:Iterated over {} rows in {:4f}s | Sample at idx [0]: {}'.format(
len(df), time.time() - s, result[0]))
s = time.time()
result = [[x['col1'] + 1, x['col2']] for x in df.to_dict('record')]
print('to_dict: Iterated over {} rows in {:4f}s | Sample at idx [0]: {}'.format(
len(df), time.time() - s, result[0]))
s = time.time()
result = [[x['col1'] + 1, x['col2']] for _, x in df.iterrows()]
print('iterrows: Iterated over {} rows in {:4f}s | Sample at idx [0]: {}'.format(
len(df), time.time() - s, result[0]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment