Skip to content

Instantly share code, notes, and snippets.

View liquidcarbon's full-sized avatar
🙃

Alexander Kislukhin liquidcarbon

🙃
  • Colorful Colorado
View GitHub Profile
@liquidcarbon
liquidcarbon / value_counts_mapreduce.py
Created July 1, 2020 15:31
Map-Reduce implementation of COUNT GROUP BY on each column of a large dataframe
from functools import reduce
import numpy as np
import pandas as pd
class Counts:
"""COUNT ... GROUP BY on every column of a large dataset"""
def __init__(self, file, ddl_file, n_cols=None, n_top=10):
self.file = file
self.columns = get_columns_from_ddl(ddl_file)
@liquidcarbon
liquidcarbon / fix_linebreaks.py
Created June 29, 2020 23:11
Remove newline and caret return characters to fix broken lines in a large data export
import sys
def fix(file, sep, nf, output):
"""Checks and fixes prematurely terminated lines in a tabular file.
:param file: input file
:param sep: delimiter or its ASCII **octal** code
:param nf: expected number of fields
:param output: output file
:return: None