Skip to content

Instantly share code, notes, and snippets.

@zacharysyoung
Last active March 4, 2023 20:36
Show Gist options
  • Select an option

  • Save zacharysyoung/0205343c49a121f4462aca862c4a615f to your computer and use it in GitHub Desktop.

Select an option

Save zacharysyoung/0205343c49a121f4462aca862c4a615f to your computer and use it in GitHub Desktop.
Answering SO-75608149

For CSV to dataclass...

I orginally had this logic to check if the row of a CSV contained any blank values:

n_cols = len(rows[0])
for row in rows:
    if len([x for x in row if x]) != n_cols:
        continue
    pass

But then I realized that the list comp has to iterate every field, even if the row has a blank in the first field.

So, I thought, "would a for loop be faster?":

for row in rows:
    if "" in row:
        continue
    pass

The for-loop easily wins over the list-comp:

Type Blank placement Time (s) for 30M rows
for_loop BEG 0.32
for_loop MID 1.04
for_loop END 1.79
list_comp BEG 6.62
list_comp MID 6.63
list_comp END 6.65

The original solution also had a call to .strip() for every field, since I wasn't sure if OP's input CSV would have a row like, a, ,b,c, ,e (with multiple spaces between a and b). Then I remembered that csv.reader has the skipinitialspace= param that when set to True will discard all leading whitespace before the; this reduces fields with all whitespace to "". This obviates the need to .strip() on the backend, saving a costly function call in the tight loop.

N_ROWS = 30_000_000
BEG = [[""] + ["a"] * 10] * N_ROWS
MID = [["a"] * 5 + [""] + ["a"] * 5] * N_ROWS
END = [["a"] * 10 + [""]] * N_ROWS
def test_list_comp(rows: list[list[str]]):
n_cols = len(rows[0])
for row in rows:
if len([x for x in row if x]) != n_cols:
continue
pass
def test_list_comp_strip(rows: list[list[str]]):
n_cols = len(rows[0])
for row in rows:
if len([x for x in row if x.strip()]) != n_cols:
continue
pass
def test_for_loop(rows: list[list[str]]):
for row in rows:
if "" in row:
continue
pass
def test_for_loop_strip(rows: list[list[str]]):
def _test(row: list[str]) -> bool:
for x in row:
if x.strip() == "":
return True
return False
for row in rows:
if _test(row):
continue
pass
if __name__ == "__main__":
import timeit
for func in ["test_for_loop", "test_list_comp"]:
for strip in ["", "_strip"]:
for dset in ["BEG", "MID", "END"]:
fname = func + strip
times = timeit.repeat(
f"{fname}({dset})",
setup=f"from __main__ import {fname}, {dset}",
repeat=1,
number=1,
)
min_time = min(times)
print(f"{fname}_{dset}: {min_time:.3g}")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment