In this Gist, I will keep updating tips for python that I encounter in data analysis, especially for empirical research in finance.
This tip comes from Modern Pandas (Part 4): Performance. When appending a large number of DataFrames read from CSV files, instead of appending one by one as follows:
data = pd.DataFrame([])
for f in files:
tmpdata = pd.read_csv(f)
data.append(tmpdata)
, it is more efficient to read all of the data into memory first using list comprehension and then concatenate them as follows, provided that the memory is large enough:
dfs = [pd.read_csv(f) for f in files]
data = pd.concat(dfs)
I have a DataFrame of basic information (name, symbol, etc) of several stock indexes. I want to retrieve static data, e.g. currency, from Datastream, and merge the retrieved data to the original DataFrame, row by row.
The first method is to use a for loop:
RetrievedDf = pd.DataFrame([])
for i in range(len(IdxInfo)):
res = DWE.fetch(IdxInfo['Symbol'].iloc[i], StaticVars.values.tolist(), static=True)
RetrievedDf = RetrievedDf.append(res)
IdxInfo = pd.concat([IdxInfo, RetrievedDf.reset_index(drop=True)], axis=1)
where IdxInfo['Symbol']
contains the identifiers of stock indexes in Datastream, StaticVars contains the variables to retrieve.
The second method is to use pd.DataFrame.apply()
:
IdxInfo = IdxInfo.apply(lambda df: pd.concat([df, DWE.fetch(df['Symbol'], StaticVars.values.tolist(), static=True).iloc[0]]), axis=1)
which apply a function to each row of the DataFrame that retrieves data from Datastream and concatenate with the row.
I would expect that the second method is faster. However, for a small DataFrame with 28 rows, the first one is faster (~10s) than the second one (~11s). No idea if the second one will be faster if the sample is large.
The package to retrieve data from Datastream is from here.
When we encounter a large loop with many many steps, we typically want the program to display a progress indicator. The following code is easy to implement
for i in range(10000):
print('{0}/{1}'.format(i, 10000)
but has a pitfall that each display has a new line as follows
0/10000
1/10000
2/10000
Sometimes, we want the progress indicator to be in one line that keeps updating. In other words, the new output should overwrite the previous one, which can be implemented as follows:
import sys
for i in range(10000):
sys.stdout.write('\r{0}/{1}'.format(i, 10000))
where in each step, \r
moves the pointer to the start of the line and the new output overwrites the previous one.