Skip to content

Instantly share code, notes, and snippets.

@Mlawrence95
Mlawrence95 / md5_decorator.py
Last active May 20, 2020 22:31
A python decorator that adds a column to your pandas dataframe -- the MD5 hash of the specified column
import pandas as pd
from hashlib import md5
def text_to_hash(text):
return md5(text.encode("utf8")).hexdigest()
def add_hash(column_name="document"):
"""
Decorator. Wraps a function that returns a dataframe, must have column_name in columns.
@Mlawrence95
Mlawrence95 / read_csv_from_aws_s3_targz.python
Created July 27, 2020 22:54
Given a CSV file that's inside a tar.gz file on AWS S3, read it into a Pandas dataframe without downloading or extracting the entire tar file
# checked against python 3.7.3, pandas 0.24.2, s3fs 0.4.2
import tarfile
import io
import s3fs
import pandas as pd
tar_path = f"s3://my-bucket/debug.tar.gz" # path in s3
metadata_path = "debug/metadata.csv" # path inside of the tar file