NoSQL databases have seen a surge in popularity due to their ability to handle large volumes of unstructured data. Learning to leverage NoSQL tools such as column-oriented, document, key-value, and graph databases allows data engineers, analysts, and scientists to interact with a wider breadth of data. These NoSQL databases bring additional functionality to data storage and retrieval that isn’t available in traditional relational databases.
By Jake Roach, Senior AI Engineer
Basics of NoSQL databases. Identify the pros and cons of using column-oriented, document, key-value, and graph databases to make data available to downstream processes and data consumers.
- Traditional relational data stores (RDBMS)
- Organize data in tables, using columns and rows
- Leverage SQL to manage and manipulate data
- Enforce integrity through constraints on databases and tables
- NoSQL stands for
not only SQL
- Allows for a wider range of data to be persisted and accessed
- More flexible with regards to schema, captures data as it changes
- Better scaling and performance
- NoSQL data stores
- Column-oriented -
Snowflake
- Document -
Postgress
- Key-value -
Redis
- Graph
- Column-oriented -
import snowflake.connector
conn = snowflake.connector.connect(
user="<user>",
password="<password>",
account="<account_identifier>",
database="<database_name>",
schema="<schema_name>",
warehouse="<warehouse_name>"
)
# Build a query in a string (or multi-line string)
query = """
SELECT title, price
FROM books
WHERE price < 50.00;
"""
# Execute the query, print the results
results = conn.cursor().execute(query).fetch_pandas_all()
print(results)
import sqlalchemy
# Create a connection string, and an engine
connection_string = "postgresql+psycopg2://<user>:<password>@<host>:<port>/<database>"
db_engine = sqlalchemy.create_engine(connection_string)
import pandas as pd
# Build the query
query = """
SELECT
books -> 'title' AS title,
books -> 'price' AS price
FROM data_science_resources;
"""
# Execute the query
result = pd.read_sql(query, db_engine)
print(result)
Snowflake tables, fundamentals of micro-partitioning and data clustering to optimize query performance with query pruning. Common table expressions, materialized and non-materialized views to streamline analytics workflows. Semi-structured data with Snowflake's VARIANT types.
- Row-oriented vs. column-oriented databases
- Use for analytics workflows
- Perform well when loading, updating, or deleting data in-bulk
COPY INTO books
FROM 'file://data_science_books.csv'
FILE_FORMAT = (
TYPE = 'CSV'
FIELD_DELIMITER = ','
SKIP_HEADER = 1
);
CREATE OR REPLACE TABLE premium_books AS
SELECT * FROM books
WHERE price > 50.00;
- Micro-partitioning data with Snowflake
- Creates smaller "chunks" of rows, stored in columnar format
- Stores metadata about each partition
- Query pruning to reduce the amount of data accessed
- Efficient execution of DML (data manipulation language)
- Data clustering with Snowflake
- Organizing or grouping similar data points together
- Automatically performed during data load
- Decreasing data accessed during execution
- Improved query performance
- Common table expressions (CTEs)
- Named sub-queries/temporary tables, defined using the WITH keyword
- Creates a object that can be later queried
- Reduce the amount of data that is being queried and/or JOIN'ed
- More modular, easier to troubleshoot
- Views with Snowflake
WITH premium_books AS (
SELECT title, author, avg_reviews
FROM books
WHERE price > 25.00
)
SELECT author, MIN(avg_reviews) AS min_avg_reviews, MAX(avg_reviews) AS max_avg_reviews
FROM premium_books
GROUP BY author;
- Semi-structured data in Snowflake
OBJECT
is similar to dictionaries in PythonARRAY
is similar to lists in PythonVARIANT
type stores semi-structured data in a single column- Bracket and dot notations
Postgres JSON, and intricacies of semi-structured data. Extract and transform top-level and nested document data using built-in operators and functions.
- JSON and JSONB in Postgres
- Turning tabular data into JSON format
- Extracting keys from JSON
- Querying JSON data with Postgres
->
operator takes a key, returns field as JSON->>
operator takes a key, returns field as text
- Finding the type of data store in JSON objects
INSERT INTO students (school, age, address, parent_meta) VALUES (
'GP', 18, 'U', '{\"guardian\": \"mother\", ... \"P2\": \"at_home\"}}'
);
COPY students FROM 'students.csv' DELIMITER ',' CSV, HEADER;
SELECT row_to_json(row(
school, age, address
))
FROM students;
SELECT json_object_keys(parent_meta)
FROM students;
SELECT
parent_meta -> 'guardian' AS guardian
parent_meta ->> 'status' AS status
FROM student;
SELECT
parent_meta -> 'jobs' ->> 'P1' AS jobs_P1,
parent_meta -> 'jobs' ->> 'P2' AS jobs_P2
FROM student;
SELECT
parent_meta -> 'educations' ->> 0
parent_meta -> 'educations' ->> 1
FROM student;
SELECT json_typeof(parent_meta -> 'jobs')
FROM students;
- Advanced Postgres JSON query techniques
#>
and#>>
operatorsjson_extract_path
andjson_extract_path_text
functions
SELECT
parent_meta #> '{jobs}' AS jobs,
parent_meta #> '{jobs, P1}' AS jobs_P1,
parent_meta #> '{jobs, income}' AS income,
parent_meta #>> '{jobs, P2}' AS jobs_P2
FROM student;
SELECT
json_extract_path(parent_meta, 'jobs') AS jobs,
json_extract_path(parent_meta, 'jobs', 'P1') AS jobs_P1,
json_extract_path(parent_meta, 'jobs', 'income') AS income,
json_extract_path_text(parent_meta, 'jobs', 'P2') AS jobs_P2,
FROM student;
Fundamentals of key-value databases. Common use-cases for key-value data, use Redis and Python to read and write data. Basic overview of graph databases.
- A NoSQL database that uses simple key-value pairs (similar to dictionaries in Python) to store data
- Can be searched by key, not the value
- Store values of type string, hash, or lists
- Performant for simple read and writes of data
- Data is stored in memory, instead of on disk
- Commonly used in web applications, for:
- Session management
- Caching frequently accessed data
- Track user preferences and behavior
- Key-value databases source analytics platforms - Snowflake, Redshift
# Import redis, make a connection
r = redis.Redis(...)
# Store a key-value pair
r.set("username", "JDoe")
r.set("age", 27)
r.set("username", "BSmith")
# Retrive the key-value pair
username = r.get("username")
# Store a dictionary using .hset()
r.hset(
"shopping_cart",
mapping={
"item_id": "1003",
"quantity": 2,
"price": 79.99
}
)
# Retrieve the dictionary
r.hgetall("shopping_cart")
- Graph databases
- NoSQL data stores that persist data in a network of nodes and edges
- Each node represents an entity
- Each edge represents a relationship between those entities
- Graph databases are commonly used for
- Social networks
- Recommendation engines
- Fraud detection
- Studying patterns and relationships