pydataextract.md

Dealing with Spreadsheets in Python

A friend recently came to me with a data management problem. He has a stack of Excel spreadsheets that need to be turned into an API, which means that the data is going to have to be dumped out of those spreadsheets and inserted into a database. I haven't seen the data yet, so I can't make a recommendation on how we should tackle it - but I have some ideas.

Option 1: Export to Access

Microsoft Excel actually provides a really nice integration with Access, and Access allows you to set up data extracts that can be loaded into your data storage mechanism of choice. This is easily the most "point-and-click" method for approaching this task if you're of the GUI persuasion. The "External Data" tab of your Microsoft Access database will have everything you need to import and export your data, if you'd prefer to use this method. Given the quality of the wizards available to help you, I won't be discussing this one much.

Option 2: Work with CSVs

If the data is fairly straightforward text-based information, you can use Python's CSV library to start converting it to something usable. This requires the data to be flat (i.e., only one worksheet), which might prove to be a problem if you've received a standard 20-sheet Excel behemoth. Each of those worksheets would have to be saved as a .csv file individually and processed individually before final merging, which may not be your most effective approach. Still, if you've only got a sheet or two to deal with, the CSV library is a fairly quick and effective way to quickly produce a dataset. I'm assuming you're using Python 2.x here.

import csv

data_set = []

with open('file.csv', 'rb') as csvfile:
	csvreader = csv.reader(csvfile, delimiter=',')
	for row in csvreader:
		data_set.append(row)
		
print data_set

# [['field1','field2','field3','field4'],['field5','field6','field7','field8'],['field9','field10','field11','field12']]

What have we done here? We've created a list of lists, where each primary list item is another list containing the values within that row. This is certainly not the only way to approach this task. Perhaps I actually want to create a dictionary, with each item having a key that corresponds to the row number:

import csv

data_set = {}

row_count = 1

with open('file.csv', 'rb') as csvfile:
	csvreader = csv.reader(csvfile, delimiter=',')
	for row in csvreader:
		data_set[row_count] = row
		row_count += 1

print data_set

Maybe the first row entry is actually the primary key, so we need to establish that in the dictionary:

import csv

data_set = {}

with open('file.csv', 'rb') as csvfile:
	csvreader = csv.reader(csvfile, delimiter=',')
	for row in csvreader:
		data_set[row[0]] = row[1:]

print data_set

What if our first row is actually headers?

import csv

data_set = []

with open('file.csv', 'rb') as csvfile:
	csvreader = csv.reader(csvfile, delimiter=',')
	headers = csvreader.next()
	
	for row in csvreader:
		entry = {}
		field_count = 0
		for field in row:
			entry[headers[field_count]] = field
		data_set.append(entry)

print data_set

I'm not going to tell you how to cut your data. That one's up to you based on how you need it established for further manipulation.

Option 3: xlrd

xlrd is a nifty little Python module that I've used when I have larger Excel datasets. pip install xlrd ought to do the trick for you if you're pure of heart and are pipping all over the place. If you're not using pip, fix that first.

Once you have xlrd installed, you can start to work with the various workbooks:

from xlrd import open_workbook

book = open_workbook('file.xls')
data_set = {}

for sheet in book.sheets():
	sheet_data = []
	for row in range(sheet.nrows):
		values = []
		for col in range(sheet.ncols):
			values.append(sheet.cell(row, col).value)
		sheet_data.append(values)
	data_set[sheet.name] = sheet_data

What have we done here? We've established a dictionary where each key represents a named worksheet, and each value represents the data from that sheet. You can certainly cut your data in any way you feel, following the formatting patterns I established in the CSV discussion. Those patterns will work equally well once you've read the data using xlrd.

From Here

From here, you've got a lot of options for playing with your data. Dump it as an API and serve it with Flask and MongoDB if you like. Throw it into pandas and massage it on the fly. Print it and take a look at it in the console. Just follow your heart.

josephmosby/pydataextract.md

Dealing with Spreadsheets in Python

Option 1: Export to Access

Option 2: Work with CSVs

Option 3: xlrd

From Here