Last active
December 28, 2018 06:22
-
-
Save shiumachi/d916b9459e56b496466aa2aa24859be7 to your computer and use it in GitHub Desktop.
日付単位に分けられた複数のCSVファイルを月単位のParquetファイルに変換する
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This script compacts daily based csv files to monthly based parquet file. | |
# The CSV files should be named like "YYYY-MM-DD.csv" format. | |
# | |
# このスクリプトは日付毎のcsvファイルを月毎のparquetファイルに変換します。 | |
# CSVファイルの名前は"YYYY-MM-DD.csv"の形式にしてください。 | |
# | |
import pandas as pd | |
import numpy as np | |
import pyarrow as pa | |
import pyarrow.parquet as pq | |
from glob import glob | |
# configure this parameter | |
year = 2017 | |
dirs = glob("*.csv") | |
df = pd.DataFrame(pd.Series.from_array(np.array(dirs)), columns=["filename"]) | |
for month in range(1, 13): | |
df2 = pd.DataFrame() | |
for filename in df[df.filename.str.contains("{}-{:02}-*".format(year, month))]["filename"]: | |
df2 = df2.append(pd.read_csv(filename)) | |
table = pa.Table.from_pandas(df2) | |
pq.write_table(table, "{}-{:02}.parq".format(year, month), compression="gzip") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment