Skip to content

Instantly share code, notes, and snippets.

@jtrive84
Last active July 22, 2024 08:55
Show Gist options
  • Save jtrive84/38faee556a5106c4be9c38635f3ec2e5 to your computer and use it in GitHub Desktop.
Save jtrive84/38faee556a5106c4be9c38635f3ec2e5 to your computer and use it in GitHub Desktop.
Introduction to Python's sqlite3 library and typical usage scenarios.

Introduction to SQLite in Python

Relevant Links:


SQLite is a C library that provides a lightweight disk-based database that doesn't require a separate server process. Applications can use SQLite for internal data storage. It's also possible to prototype an application using SQLite then later migrate to a more fully-featured RDBMS. SQLite databases are stored in a file on disk (usually with a .db extension). If you attempt to connect to a database file that doesn't exist, SQLite with create a new database, assign it the name you passed to the connect function and save it to your current working directory.

Non-Pandas SQLite usage typically follows:

  • Create a connection object
  • Define a cursor
  • Construct a query for the dataset of interest
  • Pass the query string to the cursor's execute method
  • Iterate over the cursor's result set
import sqlite3

db = sqlite3.connect(<filename>.db)
cursor = db.cursor()  
SQL = "SELECT * FROM SAMPLE_TABLE"  
cursor.execute(SQL) 
 
# Iterate over cursor to produce queried records.
for record in cursor: 
    print(record)

The result will be a list of tuples, so data elements can be accessed by row or selectively by referencing components by index offset.

Creating Datebases and Tables with sqlite3

If the database file passed into sqlite3.connect doesn't exist, a new database with the name specified will be created. The following example creates a database consisting of 2 tables: The first table holds closing stock prices, the second contains a mapping between ticker symbols and company names (for more information on SQLite datatypes and the resulting affinity mappings of common datatypes for other RDBMS, checkout this page from the SQLite official documentation):

"""
Creating a new database with two tables using sqlite3.

=============================
Table 1                     |
=============================
TABLENAME: `CLOSING_PRICES` |
                            |
FIELDS   : DATE   TEXT      |
           TICKER TEXT      |
           CLOSE  REAL      |
=============================

=============================
Table 2                     |
=============================
TABLENAME: `TICKER_MAPPING` |
                            |
FIELDS   : TICKER       TEXT|
           COMPANY NAME TEXT|     
=============================
"""
import sqlite3

# Create new database `sample.db`. Notice `sample.db` is now 
# listed in your working directory.
db = sqlite3.connect("sample.db")

# Initiate a cursor, and call the connection's cursor method.
cursor = db.cursor()

# Specify the DDL to create the two tables:
tbl1_ddl = """CREATE TABLE CLOSING_PRICES (
              DATE   TEXT,
              TICKER TEXT,
              CLOSE  REAL)"""

tbl2_ddl = """CREATE TABLE TICKER_MAPPING (
              TICKER       TEXT,
              COMPANY_NAME TEXT)"""

# Call the `cursor.execute` method, passing tbl1_ddl & tbl2_ddl as arguments.
cursor.execute(tbl1_ddl)
cursor.execute(tbl2_ddl)

# IMPORTANT! Be sure to commit changes you want to persist. Without
# commiting, changes will not be saved.
db.commit()

# Close connection to `sample.db`.
db.close()

To verify that the tables have been created, run the following:

db = sqlite3.connect('sample.db')
cursor = db.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
db.close()

Loading Data into SQLite Database Tables (without Pandas)

Next we present various methods for loading data into SQLite tables without Pandas. The first method loads the data record by record. In this example, we load values into the TICKER_MAPPING table calling cursor.execute:

"""
Method #1: Load data record by record. 
"""
db = sqlite3.connect('sample.db')
cursor = db.cursor()

cursor.execute("INSERT INTO TICKER_MAPPING VALUES ('AXP',  'American Express Company')")
cursor.execute("INSERT INTO TICKER_MAPPING VALUES ('GE' ,  'General Electric Company')")
cursor.execute("INSERT INTO TICKER_MAPPING VALUES ('GS' ,  'Goldman Sachs Group Inc')")
cursor.execute("INSERT INTO TICKER_MAPPING VALUES ('UTX' , 'United Technologies Corporation')")

We can populate the CLOSING_PRICES table using a list of tuples in a single call to cursor.executemany:

closing_prices = [
    ('20160722', 'AXP',  64.28), ('20160722', 'GE' ,  32.06),
    ('20160722', 'GS' , 160.41), ('20160722', 'UTX', 105.13)
    ]
    
cursor.executemany("INSERT INTO CLOSING_PRICES VALUES (?,?,?)", closing_prices)

The (?,?,?) in cursor.executemany serve as placeholders for columns in the target table. There should be one ? for each column in the target table.

The more common scenario is loading data from file into a SQLite database. In what follows, we load closing price data from the dataset available here into the CLOSING_PRICES table:

"""
Loading data from .csv file into CLOSING_PRICES table. Download dataset from here:

         https://gist.github.com/jtrive84/52fcbb7a8637d72dbc49b3fdeb1edbee
         
and name it `ticker_data.csv` in the current working directory. 
"""
import csv

try:
    db = sqlite3.connect('sample.db')
    cursor = db.cursor()
    
    with open('ticker_data.csv', 'r') as ff:
        fcsv = csv.reader(f)
    
        # `recs_to_load` is a list of records contained in `ticker_data.csv`.
        recs_to_load = [record for record in fcsv]

        # Call `cursor.executemany`, specifying `recs_to_load`.
        cursor.executemany("INSERT INTO CLOSING_PRICES VALUES (?,?,?)", recs_to_load)
finally:
    db.commit()
    db.close()

Retrieving Data from SQLite without Pandas

To retrieve SQLite database records, an iterator in the form of a database cursor is returned, which is traversed to obtain to returned dataset elements:

import sqlite3

try:
    db = sqlite3.connect('sample.db')
    cursor = db.cursor()

    # Query to select all records from `CLOSING_PRICES`:
    SQL = "SELECT * FROM CLOSING_PRICES"

    cursor.execute(SQL)

    # Iterate over cursor to print records. 
    for rec in cursor:
        print(rec)
        
 finally:
    db.close()

Headers need to be extracted from the cursor.description attribute:

headers = [ii[0] for ii in cursor.description]

Using Bind Variables

The following is an example that uses bind variables with SQLite:

import sqlite3

try:
    db = sqlite3.connect('sample.db')
    cursor = db.cursor()

	# Bind variable key-value pairs.
    params = {'symbol':'GE', 'date':'20161125'}
 	SQL = "SELECT * FROM CLOSING_PRICES WHERE TICKER=:symbol AND DATE!=:date"
	cursor.execute(SQL, params)
	headers = [i[0] for i in cursor.description]
finally:
	db.close()

Working with SQLite from Pandas

Working with SQLite from Pandas is straightforward. Connections are established just as before, but the need to work directly with cursor objects is eliminated. Instead, we access a set of general DataFrame methods to read from and write to SQLite databases.

Loading Data in SQLite from Pandas

When using Pandas, it isn't necessary to specify the DDL associated with a table. Instead, datatypes are inferred from each column's dtype. In what follows, we create two DataFrames corresponding the CLOSING_PRICES and TICKER_MAPPING:

import sqlite3
import pandas as pd

# Load close data dfrom GitHub directly.
dfprices = pd.read_csv("https://gist.github.com/jtrive84/52fcbb7a8637d72dbc49b3fdeb1edbee/raw/8e53c2f5ed356319afd6fb6641dd1347247af409/ticker_data.csv")

dfmapping = pd.DataFrame({
    'AXP':['American Express Company'],
    'GE':['General Electric Company'],
    'GS':['Goldman Sachs Group Inc'],
    'UTX':['United Technologies Corporation'],
    })
	
# Load dfmapping and dfprices into sample.db.
try:
    conn = sqlite3.connect("sample.db")
    dfprices.to_sql("CLOSING_PRICES", conn, if_exists="replace", index=False)
    dfmapping.to_sql("TICKER_MAPPING", conn, if_exists="replace", index=False)
finally:
    conn.close()

If index=False is not included in to_sql, the DataFrame's index will be present as the left-most column in the exported table.

We set if_exists to "replace". Other options are "fail" if the table should not be overwritten, or "append" to combine new data with existing records.

To verify that the tables have been created, run the following:

In [1]: SQL = "SELECT * FROM TABLENAME"
In [2]: pd.read_sql("SELECT name FROM sqlite_master WHERE type='table'", conn)
Out[1]:
   	     name
0  CLOSING_PRICES
1  TICKER_MAPPING

Querying Data into SQLite from Pandas

Querying data from SQLite is similarly straightforward. Let's retrieve all records from the CLOSING_PRICES table in which CLOSE is greater than 30:

"""
Querying SQLite data from Pandas.
"""
import sqlite3
import pandas as pd

# Retrieve records where CLOSE > 30.
In [1]: conn = sqlite3.connect("sample.db")
In [2]: SQL = "SELECT * FROM CLOSING_PRICES WHERE CLOSE>30"
In [3]: dfprices30 = pd.read_sql(SQL, conn)
In [4]: dfprices30
Out[1]:
	DATE TICKER      CLOSE
0   20161125     GE  31.440001
1   20161123     GE  31.340000
2   20161122     GE  31.180000
3   20161121     GE  30.870001
4   20161118     GE  30.670000
5   20161117     GE  30.790001
6   20161116     GE  30.740000
7   20161115     GE  30.750000
8   20161114     GE  30.510000
9   20161111     GE  30.709999
10  20161110     GE  30.410000

Although not applicable for these tables, there may be datasets that are too large to retrieve in a single transaction. For such tables, we can supply read_sql with a chunksize argument, specifying the number of records to retrieve in each chunk. When chunksize is provided, the result is no longer a DataFrame but an iterator, and the results can be processed as such.

In what follows, CLOSING_PRICES is read into a list of at most 5 record DataFrames, which are then concatenated into a single DataFrame via pd.concat:

import sqlite3
import pandas as pd

try:
    conn = sqlite3.connect("sample.db")
    SQL = "SELECT * FROM CLOSING_PRICES"
    dfiter = pd.read_sql(SQL, conn, chunksize=5)
    df_list = [df for df in dfiter]	
    dfprices = pd.concat(df_list).reset_index(drop=True)
finally:
    conn.close()

df_list is a list of 5 DataFrames, which are concatenated into a single DataFrame via pd.concat.

SQLiteStudio

It can be useful to interact with SQLite databases from a graphical interface, especially as the number of database tables grows. One such tool is SQLiteStudio, a versatile SQLite IDE that includes the tools necessary to manage databases, schemas, tables and related objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment