-
-
Save dnk8n/925db332579a539f9dea37301906fd45 to your computer and use it in GitHub Desktop.
from pathlib import Path | |
from typing import Any, Callable, Dict, Iterator, List, Optional, Union | |
from openpyxl import load_workbook | |
def xlsx_dictreader( | |
filename: Union[str, Path], | |
sheet_index: int = 0, | |
header_row_index: int = 1, | |
data_start_row_index: int = 2, | |
data_only: bool = True, | |
post_process_funcs: Optional[List[Callable[[Any], Any]]] = None, | |
null_vals: List[Union[None, str]] = [None, "None"], | |
) -> Iterator[Dict[str, Any]]: | |
book = load_workbook(filename, data_only=data_only) | |
sheet = book.worksheets[sheet_index] | |
header = [ | |
c | |
for c in (cell.value for cell in sheet[header_row_index]) | |
if c not in null_vals | |
] | |
if not post_process_funcs: | |
post_process_funcs = [lambda x: x] * len(header) | |
else: | |
if len(post_process_funcs) != len(header): | |
raise Exception( | |
"post-processing functions do not line up with headers" | |
) | |
for row_idx in range(data_start_row_index, sheet.max_row + 1): | |
candidate = {} | |
for col_idx in range(1, sheet.max_column + 1): | |
if col_idx <= len(header): | |
candidate[header[col_idx - 1]] = post_process_funcs[ | |
col_idx - 1 | |
](sheet.cell(row=row_idx, column=col_idx).value) | |
if not all(value in null_vals for value in candidate.values()): | |
yield candidate |
You can also pass in a path object, e.g.
from pathlib import Path
excelpath = Path('/path/to/excel.file.xlsx')
excel_dicts = xlsx_dictreader(excelpath)
for excel_dict in excel_dicts:
print(excel_dict)
I found that I often needed a different processor per column,
e.g.:
- setting values to ints instead of floats
- treating 1, y, yes, 'True', true -> True
- any other function
For this, it is possible to do the following (these functions are absolutely not fit for production, they are very brittle and only work if you can be sure the data fits a specific format):
def int_or_none(x):
# Doesn't work for 0. An example of why not to use these functions... they are for illustrative purposes only
if x:
return int(x)
else:
return None
def true_or_false(x):
if str(x).lower() in ['1', 'true', 'yes']:
return True
elif str(x).lower() in ['0', 'false', 'no']:
return False
else:
return x
post_process_funcs=[
str,
lambda x: int(x) if x else None,
true_or_false
int_or_none
]
exceldicts = xlsx_dictreader(
'/path/to/excel.file.xlsx',
post_process_funcs=post_process_funcs
)
for excel_dict in excel_dicts:
print(excel_dict)
I also found that openpyxl would incorporate values of stringed 'None' sometimes, so I added an option to define null_vals. This is only used to omit rows of your data that have every value set to either of this list... so should be pretty safe to use as it is unlikely to be a very useful row of data!
Forgive me, I am not as advanved in Python as you are and my VS Code is higlighting an error in this line:
for row_idx in range(data_start_row_index, sheet.max_row):
--> internal_col_idx = col_idx - 1
col_idx is not defined.
How to fix this?
Thanks for asking. I hardly remember writing this code! Let me give it an update based on my updated knowledge.
I can see a few issues with this code!
I found it on stackoverflow and would find it very useful for instantiating objects from a table in an Excel file. Thanks in advance!
@splosarek I fixed it, please see updated version.
I do note that it doesn't work for my example Excel file of format:
links | texts | date | snippets | predicted_labels | confidence | |
---|---|---|---|---|---|---|
89784 | https://www.theguardian.com/environment/2017/aug/15/bolivia-approves-highway-in-amazon-biodiversity-hotspot-as-big-as-jamaica | National park which is home to thousands of indigenous people... | 2017-08-15 | which would cut through an Amazon biodiversity hotspot almost ... | not company | 0,994122326374054 |
Each column needs a heading. So in the case that I modify my example excel file to include the text 'ID' in A1, it works as intended.
Thank you! I'll work with it and let you know in case of issues.
@splosarek I found and fixed another issue that caused the last dict item to be dropped.
Yeah, I found it too and applied exactly the same fix :)
Works beautifully! Thank you so much!
Glad it can be of use.
My advice is to build up some unit tests for your use case. Using code in the wild always comes with a bit of risk, so it is useful to guarantee so far as you can, that it is working well for your use case.
Don't just run and hope, as the initial author might have had another set of scenarios and not thought about the edge cases present in yours!
Thanks for the alert and for giving me an opportunity to see where I went wrong unknowingly :)
Example simple usage: