Relevant Links:
SQLite is a C library that provides a lightweight disk-based database that doesn't require a separate server process. Applications can use SQLite for internal data storage. It's also possible to prototype an application using SQLite then later migrate to a more
fully-featured RDBMS.
SQLite databases are stored in a file on disk (usually with a .db
extension). If you attempt to connect to a database file that doesn't exist, SQLite
with create a new database, assign it the name you passed to the connect function and save it to your current working directory.
Non-Pandas SQLite usage typically follows:
- Create a connection object
- Define a cursor
- Construct a query for the dataset of interest
- Pass the query string to the cursor's
execute
method - Iterate over the cursor's result set
import sqlite3
db = sqlite3.connect(<filename>.db)
cursor = db.cursor()
SQL = "SELECT * FROM SAMPLE_TABLE"
cursor.execute(SQL)
# Iterate over cursor to produce queried records.
for record in cursor:
print(record)
The result will be a list of tuples, so data elements can be accessed by row or selectively by referencing components by index offset.
If the database file passed into sqlite3.connect
doesn't exist, a new database with the name specified will be created. The following example creates a database consisting of 2 tables: The first table holds closing stock prices, the second contains a mapping between ticker symbols and company names (for more information on SQLite datatypes and the resulting affinity mappings of common datatypes for other RDBMS, checkout this page from the SQLite official documentation):
"""
Creating a new database with two tables using sqlite3.
=============================
Table 1 |
=============================
TABLENAME: `CLOSING_PRICES` |
|
FIELDS : DATE TEXT |
TICKER TEXT |
CLOSE REAL |
=============================
=============================
Table 2 |
=============================
TABLENAME: `TICKER_MAPPING` |
|
FIELDS : TICKER TEXT|
COMPANY NAME TEXT|
=============================
"""
import sqlite3
# Create new database `sample.db`. Notice `sample.db` is now
# listed in your working directory.
db = sqlite3.connect("sample.db")
# Initiate a cursor, and call the connection's cursor method.
cursor = db.cursor()
# Specify the DDL to create the two tables:
tbl1_ddl = """CREATE TABLE CLOSING_PRICES (
DATE TEXT,
TICKER TEXT,
CLOSE REAL)"""
tbl2_ddl = """CREATE TABLE TICKER_MAPPING (
TICKER TEXT,
COMPANY_NAME TEXT)"""
# Call the `cursor.execute` method, passing tbl1_ddl & tbl2_ddl as arguments.
cursor.execute(tbl1_ddl)
cursor.execute(tbl2_ddl)
# IMPORTANT! Be sure to commit changes you want to persist. Without
# commiting, changes will not be saved.
db.commit()
# Close connection to `sample.db`.
db.close()
To verify that the tables have been created, run the following:
db = sqlite3.connect('sample.db')
cursor = db.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
db.close()
Next we present various methods for loading data into SQLite tables without Pandas. The first method
loads the data record by record. In this example, we load values into the TICKER_MAPPING table
calling cursor.execute
:
"""
Method #1: Load data record by record.
"""
db = sqlite3.connect('sample.db')
cursor = db.cursor()
cursor.execute("INSERT INTO TICKER_MAPPING VALUES ('AXP', 'American Express Company')")
cursor.execute("INSERT INTO TICKER_MAPPING VALUES ('GE' , 'General Electric Company')")
cursor.execute("INSERT INTO TICKER_MAPPING VALUES ('GS' , 'Goldman Sachs Group Inc')")
cursor.execute("INSERT INTO TICKER_MAPPING VALUES ('UTX' , 'United Technologies Corporation')")
We can populate the CLOSING_PRICES table using a list of tuples in a single call to cursor.executemany
:
closing_prices = [
('20160722', 'AXP', 64.28), ('20160722', 'GE' , 32.06),
('20160722', 'GS' , 160.41), ('20160722', 'UTX', 105.13)
]
cursor.executemany("INSERT INTO CLOSING_PRICES VALUES (?,?,?)", closing_prices)
The (?,?,?)
in cursor.executemany
serve as placeholders for columns in the target table. There should be one ?
for each column in the target table.
The more common scenario is loading data from file into a SQLite database. In what follows, we load closing price data from the dataset available here into the CLOSING_PRICES table:
"""
Loading data from .csv file into CLOSING_PRICES table. Download dataset from here:
https://gist.github.com/jtrive84/52fcbb7a8637d72dbc49b3fdeb1edbee
and name it `ticker_data.csv` in the current working directory.
"""
import csv
try:
db = sqlite3.connect('sample.db')
cursor = db.cursor()
with open('ticker_data.csv', 'r') as ff:
fcsv = csv.reader(f)
# `recs_to_load` is a list of records contained in `ticker_data.csv`.
recs_to_load = [record for record in fcsv]
# Call `cursor.executemany`, specifying `recs_to_load`.
cursor.executemany("INSERT INTO CLOSING_PRICES VALUES (?,?,?)", recs_to_load)
finally:
db.commit()
db.close()
To retrieve SQLite database records, an iterator in the form of a database cursor is returned, which is traversed to obtain to returned dataset elements:
import sqlite3
try:
db = sqlite3.connect('sample.db')
cursor = db.cursor()
# Query to select all records from `CLOSING_PRICES`:
SQL = "SELECT * FROM CLOSING_PRICES"
cursor.execute(SQL)
# Iterate over cursor to print records.
for rec in cursor:
print(rec)
finally:
db.close()
Headers need to be extracted from the cursor.description
attribute:
headers = [ii[0] for ii in cursor.description]
The following is an example that uses bind variables with SQLite:
import sqlite3
try:
db = sqlite3.connect('sample.db')
cursor = db.cursor()
# Bind variable key-value pairs.
params = {'symbol':'GE', 'date':'20161125'}
SQL = "SELECT * FROM CLOSING_PRICES WHERE TICKER=:symbol AND DATE!=:date"
cursor.execute(SQL, params)
headers = [i[0] for i in cursor.description]
finally:
db.close()
Working with SQLite from Pandas is straightforward. Connections are established just as before, but the need to work directly with cursor objects is eliminated. Instead, we access a set of general DataFrame methods to read from and write to SQLite databases.
When using Pandas, it isn't necessary to specify the DDL associated with a table. Instead, datatypes are inferred from each column's dtype. In what follows, we create two DataFrames corresponding the CLOSING_PRICES and TICKER_MAPPING:
import sqlite3
import pandas as pd
# Load close data dfrom GitHub directly.
dfprices = pd.read_csv("https://gist.github.com/jtrive84/52fcbb7a8637d72dbc49b3fdeb1edbee/raw/8e53c2f5ed356319afd6fb6641dd1347247af409/ticker_data.csv")
dfmapping = pd.DataFrame({
'AXP':['American Express Company'],
'GE':['General Electric Company'],
'GS':['Goldman Sachs Group Inc'],
'UTX':['United Technologies Corporation'],
})
# Load dfmapping and dfprices into sample.db.
try:
conn = sqlite3.connect("sample.db")
dfprices.to_sql("CLOSING_PRICES", conn, if_exists="replace", index=False)
dfmapping.to_sql("TICKER_MAPPING", conn, if_exists="replace", index=False)
finally:
conn.close()
If index=False
is not included in to_sql
, the DataFrame's index will be present as the left-most column in the exported
table.
We set if_exists
to "replace". Other options are "fail" if the table should not be overwritten, or "append"
to combine new data with existing records.
To verify that the tables have been created, run the following:
In [1]: SQL = "SELECT * FROM TABLENAME"
In [2]: pd.read_sql("SELECT name FROM sqlite_master WHERE type='table'", conn)
Out[1]:
name
0 CLOSING_PRICES
1 TICKER_MAPPING
Querying data from SQLite is similarly straightforward. Let's retrieve all records from the CLOSING_PRICES table in which CLOSE is greater than 30:
"""
Querying SQLite data from Pandas.
"""
import sqlite3
import pandas as pd
# Retrieve records where CLOSE > 30.
In [1]: conn = sqlite3.connect("sample.db")
In [2]: SQL = "SELECT * FROM CLOSING_PRICES WHERE CLOSE>30"
In [3]: dfprices30 = pd.read_sql(SQL, conn)
In [4]: dfprices30
Out[1]:
DATE TICKER CLOSE
0 20161125 GE 31.440001
1 20161123 GE 31.340000
2 20161122 GE 31.180000
3 20161121 GE 30.870001
4 20161118 GE 30.670000
5 20161117 GE 30.790001
6 20161116 GE 30.740000
7 20161115 GE 30.750000
8 20161114 GE 30.510000
9 20161111 GE 30.709999
10 20161110 GE 30.410000
Although not applicable for these tables, there may be datasets that are too large to retrieve in a single transaction.
For such tables, we can supply read_sql
with a chunksize
argument, specifying the number of records to retrieve in each chunk.
When chunksize
is provided, the result is no longer a DataFrame but an iterator, and the results can be processed as such.
In what follows, CLOSING_PRICES is read into a list of at most 5 record DataFrames, which are then concatenated into
a single DataFrame via pd.concat
:
import sqlite3
import pandas as pd
try:
conn = sqlite3.connect("sample.db")
SQL = "SELECT * FROM CLOSING_PRICES"
dfiter = pd.read_sql(SQL, conn, chunksize=5)
df_list = [df for df in dfiter]
dfprices = pd.concat(df_list).reset_index(drop=True)
finally:
conn.close()
df_list
is a list of 5 DataFrames, which are concatenated into a single DataFrame via pd.concat
.
It can be useful to interact with SQLite databases from a graphical interface, especially as the number of database tables grows. One such tool is SQLiteStudio, a versatile SQLite IDE that includes the tools necessary to manage databases, schemas, tables and related objects.