-
-
Save dnk8n/925db332579a539f9dea37301906fd45 to your computer and use it in GitHub Desktop.
from pathlib import Path | |
from typing import Any, Callable, Dict, Iterator, List, Optional, Union | |
from openpyxl import load_workbook | |
def xlsx_dictreader( | |
filename: Union[str, Path], | |
sheet_index: int = 0, | |
header_row_index: int = 1, | |
data_start_row_index: int = 2, | |
data_only: bool = True, | |
post_process_funcs: Optional[List[Callable[[Any], Any]]] = None, | |
null_vals: List[Union[None, str]] = [None, "None"], | |
) -> Iterator[Dict[str, Any]]: | |
book = load_workbook(filename, data_only=data_only) | |
sheet = book.worksheets[sheet_index] | |
header = [ | |
c | |
for c in (cell.value for cell in sheet[header_row_index]) | |
if c not in null_vals | |
] | |
if not post_process_funcs: | |
post_process_funcs = [lambda x: x] * len(header) | |
else: | |
if len(post_process_funcs) != len(header): | |
raise Exception( | |
"post-processing functions do not line up with headers" | |
) | |
for row_idx in range(data_start_row_index, sheet.max_row + 1): | |
candidate = {} | |
for col_idx in range(1, sheet.max_column + 1): | |
if col_idx <= len(header): | |
candidate[header[col_idx - 1]] = post_process_funcs[ | |
col_idx - 1 | |
](sheet.cell(row=row_idx, column=col_idx).value) | |
if not all(value in null_vals for value in candidate.values()): | |
yield candidate |
Forgive me, I am not as advanved in Python as you are and my VS Code is higlighting an error in this line:
for row_idx in range(data_start_row_index, sheet.max_row):
--> internal_col_idx = col_idx - 1
col_idx is not defined.
How to fix this?
Thanks for asking. I hardly remember writing this code! Let me give it an update based on my updated knowledge.
I can see a few issues with this code!
I found it on stackoverflow and would find it very useful for instantiating objects from a table in an Excel file. Thanks in advance!
@splosarek I fixed it, please see updated version.
I do note that it doesn't work for my example Excel file of format:
links | texts | date | snippets | predicted_labels | confidence | |
---|---|---|---|---|---|---|
89784 | https://www.theguardian.com/environment/2017/aug/15/bolivia-approves-highway-in-amazon-biodiversity-hotspot-as-big-as-jamaica | National park which is home to thousands of indigenous people... | 2017-08-15 | which would cut through an Amazon biodiversity hotspot almost ... | not company | 0,994122326374054 |
Each column needs a heading. So in the case that I modify my example excel file to include the text 'ID' in A1, it works as intended.
Thank you! I'll work with it and let you know in case of issues.
@splosarek I found and fixed another issue that caused the last dict item to be dropped.
Yeah, I found it too and applied exactly the same fix :)
Works beautifully! Thank you so much!
Glad it can be of use.
My advice is to build up some unit tests for your use case. Using code in the wild always comes with a bit of risk, so it is useful to guarantee so far as you can, that it is working well for your use case.
Don't just run and hope, as the initial author might have had another set of scenarios and not thought about the edge cases present in yours!
Thanks for the alert and for giving me an opportunity to see where I went wrong unknowingly :)
I also found that openpyxl would incorporate values of stringed 'None' sometimes, so I added an option to define null_vals. This is only used to omit rows of your data that have every value set to either of this list... so should be pretty safe to use as it is unlikely to be a very useful row of data!