Skip to content

Instantly share code, notes, and snippets.

@jamesthomson
Created May 21, 2015 15:20
Show Gist options
  • Save jamesthomson/55aa0f69b648ad99a7a0 to your computer and use it in GitHub Desktop.
Save jamesthomson/55aa0f69b648ad99a7a0 to your computer and use it in GitHub Desktop.
importing a million song dataset file and converting to a dataframe
import pandas as pd
#open and split file then convert to df
lines = [line.strip().split("\t") for line in open("P:\\A.tsv.a.txt", "r")]
df=pd.DataFrame(lines)
#pull out columns for further split
cols=range(18,22)+range(33,42)
arrays=df.loc[1:5,cols].values
#split - doesnt work too many levels
#out=[[item.split(",") for item in array] for array in arrays]
#df2=pd.DataFrame(out)
#split column at a time
col=df[20]
#print col
res=[item.split(",") for item in col]
df2=pd.DataFrame(res)
#problem is inconsitent number of point stored in each item
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment