Skip to content

Instantly share code, notes, and snippets.

@dnk8n
Last active April 6, 2022 14:50
Show Gist options
  • Save dnk8n/925db332579a539f9dea37301906fd45 to your computer and use it in GitHub Desktop.
Save dnk8n/925db332579a539f9dea37301906fd45 to your computer and use it in GitHub Desktop.
Use openpyxl to return a dictreader
from pathlib import Path
from typing import Any, Callable, Dict, Iterator, List, Optional, Union
from openpyxl import load_workbook
def xlsx_dictreader(
filename: Union[str, Path],
sheet_index: int = 0,
header_row_index: int = 1,
data_start_row_index: int = 2,
data_only: bool = True,
post_process_funcs: Optional[List[Callable[[Any], Any]]] = None,
null_vals: List[Union[None, str]] = [None, "None"],
) -> Iterator[Dict[str, Any]]:
book = load_workbook(filename, data_only=data_only)
sheet = book.worksheets[sheet_index]
header = [
c
for c in (cell.value for cell in sheet[header_row_index])
if c not in null_vals
]
if not post_process_funcs:
post_process_funcs = [lambda x: x] * len(header)
else:
if len(post_process_funcs) != len(header):
raise Exception(
"post-processing functions do not line up with headers"
)
for row_idx in range(data_start_row_index, sheet.max_row + 1):
candidate = {}
for col_idx in range(1, sheet.max_column + 1):
if col_idx <= len(header):
candidate[header[col_idx - 1]] = post_process_funcs[
col_idx - 1
](sheet.cell(row=row_idx, column=col_idx).value)
if not all(value in null_vals for value in candidate.values()):
yield candidate
@dnk8n
Copy link
Author

dnk8n commented Sep 28, 2021

Example simple usage:

excel_dicts = xlsx_dictreader('/path/to/excel.file.xlsx')
for excel_dict in excel_dicts:
    print(excel_dict)

@dnk8n
Copy link
Author

dnk8n commented Sep 28, 2021

You can also pass in a path object, e.g.

from pathlib import Path
excelpath = Path('/path/to/excel.file.xlsx')
excel_dicts = xlsx_dictreader(excelpath)
for excel_dict in excel_dicts:
    print(excel_dict)

@dnk8n
Copy link
Author

dnk8n commented Sep 28, 2021

I found that I often needed a different processor per column,
e.g.:

  • setting values to ints instead of floats
  • treating 1, y, yes, 'True', true -> True
  • any other function

For this, it is possible to do the following (these functions are absolutely not fit for production, they are very brittle and only work if you can be sure the data fits a specific format):

def int_or_none(x):
    # Doesn't work for 0. An example of why not to use these functions... they are for illustrative purposes only
    if x:
        return int(x)
    else:
        return None

def true_or_false(x):
    if str(x).lower() in ['1', 'true', 'yes']:
        return True
    elif str(x).lower() in ['0', 'false', 'no']:
        return False
    else:
        return x

post_process_funcs=[
    str,
    lambda x: int(x) if x else None,
    true_or_false
    int_or_none
]

exceldicts = xlsx_dictreader(
    '/path/to/excel.file.xlsx',
    post_process_funcs=post_process_funcs   
)
for excel_dict in excel_dicts:
    print(excel_dict)

@dnk8n
Copy link
Author

dnk8n commented Sep 28, 2021

I also found that openpyxl would incorporate values of stringed 'None' sometimes, so I added an option to define null_vals. This is only used to omit rows of your data that have every value set to either of this list... so should be pretty safe to use as it is unlikely to be a very useful row of data!

@splosarek
Copy link

Forgive me, I am not as advanved in Python as you are and my VS Code is higlighting an error in this line:

    for row_idx in range(data_start_row_index, sheet.max_row):

--> internal_col_idx = col_idx - 1

col_idx is not defined.
How to fix this?

@dnk8n
Copy link
Author

dnk8n commented Feb 2, 2022

Thanks for asking. I hardly remember writing this code! Let me give it an update based on my updated knowledge.

I can see a few issues with this code!

@splosarek
Copy link

I found it on stackoverflow and would find it very useful for instantiating objects from a table in an Excel file. Thanks in advance!

@dnk8n
Copy link
Author

dnk8n commented Feb 2, 2022

@splosarek I fixed it, please see updated version.

I do note that it doesn't work for my example Excel file of format:

  links texts date snippets predicted_labels confidence
89784 https://www.theguardian.com/environment/2017/aug/15/bolivia-approves-highway-in-amazon-biodiversity-hotspot-as-big-as-jamaica National park which is home to thousands of indigenous people... 2017-08-15 which would cut through an Amazon biodiversity hotspot almost ... not company 0,994122326374054

Each column needs a heading. So in the case that I modify my example excel file to include the text 'ID' in A1, it works as intended.

@splosarek
Copy link

Thank you! I'll work with it and let you know in case of issues.

@dnk8n
Copy link
Author

dnk8n commented Feb 2, 2022

@splosarek I found and fixed another issue that caused the last dict item to be dropped.

@splosarek
Copy link

Yeah, I found it too and applied exactly the same fix :)
Works beautifully! Thank you so much!

@dnk8n
Copy link
Author

dnk8n commented Feb 2, 2022

Glad it can be of use.

My advice is to build up some unit tests for your use case. Using code in the wild always comes with a bit of risk, so it is useful to guarantee so far as you can, that it is working well for your use case.

Don't just run and hope, as the initial author might have had another set of scenarios and not thought about the edge cases present in yours!

Thanks for the alert and for giving me an opportunity to see where I went wrong unknowingly :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment