Although mainstream packages for empirical research in finance/economics are SAS, MATLAB, and Stata, I prefer python. There are not many updated documents and tutorials about python in its application to empirical research in finance/economics. Professor Kevin Sheppard's tutorial for Python in Econometrics is an excellent start, but its lecture notes have not kept pace on the fast growing of python packages. For example, some methods in the lecture notes will be deprecated soon according to the new documentation of pandas, such as DataFrame.ix
and Panel. Meanwhile, the documentation of pandas is voluminous and hard for researchers to grab essential information as fast as possible.
Therefore, I would like to create this file to keep a record on important objects, methods, and properties that I frequently encounter in my empirical research. It can be regarded as a cheatsheet for researchers who already have a basic knowledge of pandas and want to apply it to their empirical research immediately.
The first step to manipulate data sets is to read data from files on the disk. Since both reading and writing data sets have similar methods to implement, I put them together in this subsection.
Pandas supports reading from and writing to different formats of data files, including those that are commonly used in finance: .txt
, .csv
, .xlsx
, .dta
, .sas7bdat
, etc. A complete introduction can be found here. For me, the most commonly used one is CSV files, which are comma-delimited plain text files that record data in a spreadsheet form, due to its flexibility and compatibility with different software packages. Almost all software packages can read CSV files, but other formats, e.g. DTA-files for Stata, can only be read directly using the corresponding software, i.e. Stata.
The following code reads from a CSV file named data.csv
and returns a DataFrame object named df
:
df = pd.read_csv('data.csv')
If you check the reference for the method pandas.read_csv()
, you will find a lot of parameters that can customize your reading. However, I would suggest to read the whole data first and then do some filtering or selection, especially when you are not 100% sure about details of the data, unless the data set is too large to be read into memory once at a time.
To write a DataFrame object in memory to a CSV file on the disk is also simple:
df.to_csv('data.csv', index=False)
which writes the df
to the CSV file data.csv
. Note that if you do not assign False
to index
, the index will be automatically written to the data file. This will be unnecessary if the index of the DataFrame
is not indicative, e.g. nonnegative integers. Otherwise, your resulting data file will have an extra column named Unnamed: 0
if the index does not have a name.