Database Systems: From Foundations to Advanced Data Engineering

PART I — Foundations of Database Systems
- Chapter 1 — Introduction to Database Systems
  - 1.1 Evolution of Data Management
  - 1.2 File Systems vs Database Systems
  - 1.3 Characteristics of Database Systems
  - 1.4 Data Models Overview
  - 1.5 Database Users and Roles
  - 1.6 Database Architecture (ANSI/SPARC)
  - 1.7 Three-Level Architecture
  - 1.8 Data Independence
  - 1.9 Database System Environment
  - 1.10 Overview of Modern DBMS Examples
    - Oracle Database
    - MySQL
    - PostgreSQL
    - Microsoft SQL Server
- Chapter 2 — Data Models and Data Modeling
  - 2.1 Concept of Data Models
  - 2.2 Object-Based Logical Models
  - 2.3 Record-Based Logical Models
  - 2.4 Physical Data Models
  - 2.5 Conceptual vs Logical vs Physical Design
  - 2.6 Abstraction and Generalization
  - 2.7 Entity-Relationship (ER) Model
  - 2.8 Extended ER (EER) Model
  - 2.9 UML for Database Modeling
  - 2.10 Case Study: Designing a University Database
- Chapter 3 — Relational Data Model
  - 3.1 Structure of Relational Databases
  - 3.2 Relations, Tuples, Attributes
  - 3.3 Domains and Constraints
  - 3.4 Keys (Primary, Candidate, Super, Foreign)
  - 3.5 Integrity Constraints
  - 3.6 Relational Algebra
  - 3.7 Relational Calculus (TRC & DRC)
  - 3.8 Views and Relational Model
  - 3.9 Null Values and Three-Valued Logic
PART II — SQL and Advanced Querying
- Chapter 4 — SQL Fundamentals
  - 4.1 History and Standardization (ANSI SQL)
  - 4.2 Data Definition Language (DDL)
  - 4.3 Data Manipulation Language (DML)
  - 4.4 Data Control Language (DCL)
  - 4.5 Basic Queries (SELECT)
  - 4.6 WHERE Clause and Conditions
  - 4.7 ORDER BY and GROUP BY
  - 4.8 HAVING Clause
  - 4.9 Built-in Functions
  - 4.10 Constraints in SQL
- Chapter 5 — Advanced SQL
  - 5.1 Subqueries (Correlated & Nested)
  - 5.2 Joins (Inner, Outer, Self, Cross)
  - 5.3 Set Operations (UNION, INTERSECT, EXCEPT)
  - 5.4 Window Functions
  - 5.5 Common Table Expressions (CTE)
  - 5.6 Recursive Queries
  - 5.7 Stored Procedures
  - 5.8 Triggers
  - 5.9 Index Creation and Usage
  - 5.10 Query Performance Analysis
- Chapter 6 — Procedural SQL Extensions
  - 6.1 PL/SQL (Oracle Database)
  - 6.2 T-SQL (Microsoft SQL Server)
  - 6.3 Functions and Packages
  - 6.4 Error Handling
  - 6.5 Cursors
  - 6.6 Dynamic SQL
PART III — Database Design and Normalization
- Chapter 7 — Functional Dependencies
  - 7.1 Definition and Notation
  - 7.2 Armstrong’s Axioms
  - 7.3 Closure of Attribute Sets
  - 7.4 Canonical Cover
  - 7.5 Dependency Preservation
- Chapter 8 — Normalization Theory
  - 8.1 Anomalies in Databases
  - 8.2 First Normal Form (1NF)
  - 8.3 Second Normal Form (2NF)
  - 8.4 Third Normal Form (3NF)
  - 8.5 Boyce-Codd Normal Form (BCNF)
  - 8.6 Fourth Normal Form (4NF)
  - 8.7 Fifth Normal Form (5NF)
  - 8.8 Multi-Valued Dependencies
  - 8.9 Join Dependencies
  - 8.10 Denormalization Strategies
- Chapter 9 — Database Design Methodology
  - 9.1 Top-Down Design
  - 9.2 Bottom-Up Design
  - 9.3 View Integration
  - 9.4 Schema Refinement
  - 9.5 Case Studies (Banking, E-Commerce, Healthcare)
PART IV — Transaction Management & Concurrency
- Chapter 10 — Transaction Concepts
  - 10.1 Definition of Transaction
  - 10.2 ACID Properties
  - 10.3 Transaction States
  - 10.4 Schedules and Serializability
  - 10.5 Conflict vs View Serializability
- Chapter 11 — Concurrency Control
  - 11.1 Lock-Based Protocols
  - 11.2 Two-Phase Locking (2PL)
  - 11.3 Deadlocks and Prevention
  - 11.4 Timestamp Ordering
  - 11.5 Multiversion Concurrency Control (MVCC)
  - 11.6 Optimistic Concurrency Control
  - 11.7 Isolation Levels (Read Committed, Serializable)
- Chapter 12 — Database Recovery
  - 12.1 Failure Types
  - 12.2 Log-Based Recovery
  - 12.3 Checkpointing
  - 12.4 Shadow Paging
  - 12.5 ARIES Algorithm
  - 12.6 Backup and Restore Strategies
PART V — Storage and Indexing
- Chapter 13 — Storage Systems
  - 13.1 File Organization
  - 13.2 Heap Files
  - 13.3 Sequential Files
  - 13.4 Hashing
  - 13.5 RAID Architecture
- Chapter 14 — Indexing Structures
  - 14.1 Ordered Indices
  - 14.2 B-Tree
  - 14.3 B+ Tree
  - 14.4 Hash Indexing
  - 14.5 Bitmap Indexes
  - 14.6 R-Tree (Spatial Index)
  - 14.7 Index Selection Strategies
PART VI — Query Processing & Optimization
- Chapter 15 — Query Processing
  - 15.1 Query Parsing
  - 15.2 Query Translation
  - 15.3 Logical Query Plan
  - 15.4 Physical Query Plan
  - 15.5 Join Algorithms
    - Nested Loop Join
    - Sort-Merge Join
    - Hash Join
- Chapter 16 — Query Optimization
  - 16.1 Cost-Based Optimization
  - 16.2 Heuristic Optimization
  - 16.3 Statistics and Histograms
  - 16.4 Execution Plan Analysis
  - 16.5 Distributed Query Optimization
PART VII — Distributed and Cloud Databases
- Chapter 17 — Distributed Databases
  - 17.1 Distributed Database Concepts
  - 17.2 Fragmentation (Horizontal/Vertical)
  - 17.3 Replication Strategies
  - 17.4 Distributed Transactions
  - 17.5 Two-Phase Commit (2PC)
  - 17.6 Three-Phase Commit (3PC)
- Chapter 18 — NoSQL Databases
  - 18.1 Limitations of RDBMS
  - 18.2 CAP Theorem
  - 18.3 Key-Value Stores
  - 18.4 Document Databases
    - MongoDB
  - 18.5 Column-Family Databases
    - Apache Cassandra
  - 18.6 Graph Databases
    - Neo4j
  - 18.7 Comparison: SQL vs NoSQL
- Chapter 19 — NewSQL and Modern Databases
  - 19.1 Distributed SQL
  - 19.2 Horizontal Scalability
  - 19.3 Examples
    - CockroachDB
    - Google Spanner
PART VIII — Big Data & Data Warehousing
- Chapter 20 — Data Warehousing
  - 20.1 OLTP vs OLAP
  - 20.2 Star Schema
  - 20.3 Snowflake Schema
  - 20.4 ETL Process
  - 20.5 Data Marts
  - 20.6 OLAP Operations
- Chapter 21 — Big Data Ecosystem
  - 21.1 Big Data Characteristics
  - 21.2 Hadoop Architecture
    - Apache Hadoop
  - 21.3 MapReduce
  - 21.4 Spark
    - Apache Spark
  - 21.5 Data Lakes
PART IX — Database Security
- Chapter 22 — Database Security Fundamentals
  - 22.1 Security Threats
  - 22.2 Access Control Models
  - 22.3 SQL Injection Attacks
  - 22.4 Encryption (TDE)
  - 22.5 Auditing and Logging
- Chapter 23 — Privacy and Compliance
  - 23.1 GDPR Overview
  - 23.2 HIPAA
  - 23.3 Data Masking
  - 23.4 Secure Database Design
PART X — Emerging Topics
- Chapter 24 — In-Memory Databases
  - Redis
  - SAP HANA
- Chapter 25 — Blockchain Databases
  - 25.1 Distributed Ledger Concepts
  - 25.2 Smart Contracts
  - 25.3 Database vs Blockchain
- Chapter 26 — AI & Databases
  - 26.1 Vector Databases
  - 26.2 Embedding Storage
  - 26.3 Similarity Search
  - 26.4 AI-Augmented Query Optimization
PART XI — Practical Database Engineering
- Chapter 27 — Performance Tuning
  - 27.1 Query Profiling
  - 27.2 Index Optimization
  - 27.3 Caching Strategies
  - 27.4 Scaling Strategies
- Chapter 28 — Database DevOps
  - 28.1 CI/CD for Databases
  - 28.2 Schema Migration Tools
  - 28.3 Backup Automation
  - 28.4 Monitoring Tools
- Chapter 29 — Real-World Case Studies
  - 29.1 Large-Scale Banking Systems
  - 29.2 Social Media Architecture
  - 29.3 E-Commerce Platforms
  - 29.4 High-Frequency Trading Systems
Appendices
- A. Relational Algebra Formal Proofs
- B. SQL Reference Manual
- C. Sample Database Projects
- D. Interview Questions (Basic → Advanced)
- E. Research Papers and Further Reading

📘 Database Systems: From Foundations to Advanced Data Engineering

PART I — Foundations of Database Systems

Chapter 1 — Introduction to Database Systems

1.1 Evolution of Data Management

The history of data management is a fascinating journey that parallels the evolution of computing itself. Understanding this evolution provides crucial context for appreciating modern database systems.

Pre-Mechanical Era (Before 1800s) Before computers, data management was entirely manual. Information was recorded on clay tablets, papyrus scrolls, and paper ledgers. Ancient civilizations like the Sumerians maintained grain inventories on clay tablets around 3400 BCE. The Library of Alexandria, established in the 3rd century BCE, represented one of humanity's first large-scale attempts at organized data storage, containing hundreds of thousands of papyrus scrolls cataloged by subject matter.

Mechanical Era (1800s - 1940s) The Industrial Revolution brought mechanical aids to data processing. Herman Hollerith's punched card tabulating machine, developed for the 1890 U.S. Census, could process data ten times faster than manual methods. This invention led to the founding of the Tabulating Machine Company, which eventually became IBM. Punched cards remained a dominant data storage medium until the 1970s, with each card typically holding 80 characters of data.

Early Electronic Era (1940s - 1950s) The first electronic computers, like ENIAC (1945), used vacuum tubes and had minimal data storage capabilities. Programs were wired manually, and data was entered through switches or punched cards. Magnetic tape, introduced in 1951 with the UNIVAC I, allowed sequential data access but required reading entire tapes to find specific records. This period marked the beginning of automated data processing but lacked the random access capabilities we take for granted today.

File-Based Systems (1950s - 1960s) As magnetic disks became available in the late 1950s, file-based systems emerged. Each application maintained its own files, leading to significant data redundancy. A university might have separate student files for the registrar's office, financial aid, and housing—each containing duplicate personal information. This approach worked for simple applications but created numerous problems: data inconsistency when the same information was updated in one file but not others, limited data sharing, and no standardized access methods.

Database Management Systems Emerge (1960s - 1970s) The concept of a centralized, integrated data repository gained traction in the 1960s. Charles Bachman at General Electric developed the Integrated Data Store (IDS), considered the first DBMS, using a network data model. IBM's Information Management System (IMS), developed for the Apollo space program, introduced the hierarchical model. These early systems separated data from applications and provided programmatic access methods.

Relational Revolution (1970 - Present) Edgar F. Codd's 1970 paper "A Relational Model of Data for Large Shared Data Banks" fundamentally changed database thinking. Codd proposed organizing data into tables (relations) with mathematical foundations in set theory and predicate logic. This abstraction freed users from understanding physical storage details. IBM's System R prototype (1974-1979) and UC Berkeley's INGRES project demonstrated the relational model's practicality, leading to commercial products like Oracle (1979), DB2 (1983), and later open-source systems like MySQL (1995) and PostgreSQL (1996).

Modern Era (2000s - Present) The internet explosion created new challenges: massive scale, unstructured data, and always-available requirements. Google's BigTable (2004) and Amazon's Dynamo (2007) papers inspired NoSQL movement. Today's landscape includes relational databases, document stores, graph databases, time-series databases, and NewSQL systems, each optimized for specific use cases.

1.2 File Systems vs Database Systems

Understanding the fundamental differences between file-based approaches and database management systems illuminates why databases became essential.

File System Approach In file-based systems, applications directly create and manipulate operating system files. A payroll application might store employee records in a fixed-length format within a flat file:

Employee Record Format (Fixed Length - 100 bytes):
- Employee ID: 6 bytes (numeric)
- Last Name: 20 bytes (text)
- First Name: 15 bytes (text)
- Department: 4 bytes (numeric)
- Salary: 9 bytes (packed decimal)
- Hire Date: 8 bytes (YYYYMMDD)
- Remaining: 38 bytes (unused/spare)

The application code must handle:

Opening/closing files with appropriate permissions
Positioning file pointers to read/write specific records
Parsing raw bytes into application data structures
Managing concurrent access (typically using file locks)
Ensuring data integrity across related files

Problems with File Systems:

Data Redundancy and Inconsistency Multiple applications maintain duplicate data. A customer's address might appear in sales, support, and billing files. When the customer moves, updating all copies is error-prone, leading to inconsistent data across the organization.
Data Isolation Data exists in different formats and locations. Marketing's customer file might use comma-separated values, while sales uses fixed-width format. Creating integrated reports requires complex, custom programming to reconcile these differences.
Integrity Problems File systems don't enforce business rules. Nothing prevents a payroll application from entering negative salary values or assigning employees to non-existent departments. Each application must implement its own validation logic, often inconsistently.
Atomicity Issues If a funds transfer application crashes after debiting one account but before crediting another, the system becomes inconsistent. File systems provide no built-in mechanism to ensure that operations complete entirely or not at all.
Concurrency Control When multiple users simultaneously access files, problems arise. Consider two travel agents booking the last seat on a flight. Without proper coordination, both might sell the same seat, leading to overbooking.
Security Limitations File systems provide coarse access control (read/write/execute at file level). Granting a user access to salary information typically means granting access to the entire payroll file, exposing all employees' salaries.

Database System Advantages:

Self-Describing Nature A DBMS contains not just data but metadata—the database catalog or dictionary stores information about data structures, constraints, and relationships. Applications query this catalog to understand data formats, enabling greater flexibility and self-documentation.
Data Abstraction The DBMS provides multiple views of data. Physical storage details (file locations, block sizes, index structures) are hidden from users. The conceptual schema describes the overall logical structure, while external schemas provide customized user views. A personnel manager might see employee names and departments without salary information.
Controlled Redundancy While some redundancy might exist for performance, the DBMS manages it. Through normalization theory and integrity constraints, the system minimizes redundancy while maintaining consistency. When redundancy is intentionally introduced (like denormalization for performance), the DBMS helps maintain synchronization.
Data Independence Applications become immune to changes in storage structures and access methods. When the DBA adds indexes for performance, existing applications continue working unchanged. This logical and physical data independence revolutionized application development.
Efficient Data Access DBMS employs sophisticated optimization techniques. A query like "find all employees in department 42 earning more than $100,000" might be executed using indexes, hash lookups, or full table scans—the DBMS chooses the most efficient strategy based on data statistics.
Transaction Support ACID properties (Atomicity, Consistency, Isolation, Durability) ensure reliable data processing. The DBMS guarantees that transactions complete fully, maintain consistency, don't interfere with each other, and survive system failures.
Concurrent Access Sophisticated locking and multiversion concurrency control allow thousands of users to simultaneously access and modify data while maintaining consistency. The DBMS automatically manages lock granularity (row-level, page-level, table-level) based on workload characteristics.
Security and Authorization Fine-grained access control can restrict users to specific rows and columns. A manager might view their direct reports' salaries but not other departments. The DBMS enforces these restrictions regardless of which application accesses the data.
Backup and Recovery Automated backup procedures and recovery mechanisms protect against data loss. Point-in-time recovery can restore a database to its state just before a erroneous UPDATE statement. Log-based recovery ensures no committed transactions are lost even after system crashes.

1.3 Characteristics of Database Systems

Modern database systems exhibit several defining characteristics that distinguish them from simpler data storage mechanisms.

1. Persistence Data outlives the programs that create or access it. When an application terminates, the data remains safely stored on secondary storage. Even system failures don't compromise data durability—committed transactions survive crashes and restarts.

2. Data Integration The database serves as a central repository for multiple applications and users. Sales, inventory, accounting, and human resources all access the same integrated data store, eliminating artificial barriers between organizational functions.

3. Data Sharing Multiple users and applications simultaneously access the database. The DBMS manages this sharing, ensuring that concurrent operations don't interfere with each other. A customer service representative in New York and a warehouse clerk in London can update the same order simultaneously without conflict.

4. Controlled Redundancy While some redundancy might exist, it's carefully managed. Unlike file systems where redundancy is accidental and uncontrolled, database redundancy is intentional and controlled through mechanisms like:

Foreign keys that maintain referential integrity
Materialized views that automatically refresh
Replication that synchronizes copies

5. Data Consistency Integrity constraints ensure data accuracy and correctness at all times. Constraints include:

Domain constraints (salary must be positive)
Key constraints (employee IDs must be unique)
Entity integrity (primary keys cannot be null)
Referential integrity (foreign keys must reference existing records)
Business rules (a manager's salary must exceed their subordinates' salaries)

6. Query Capability Users can ask ad-hoc questions without programming. A marketing analyst can query "show me customers who purchased product X but not product Y in the last 30 days" without writing custom code. The DBMS's query processor optimizes and executes such requests efficiently.

7. Data Independence Applications remain unaffected by changes in data storage and access methods. When the DBA decides to partition a large table across multiple disks, existing queries continue working. When new indexes are added for performance, applications don't need modification.

8. Security Comprehensive security mechanisms protect data from unauthorized access. Features include:

Authentication (verifying user identity)
Authorization (controlling access to specific data)
Encryption (protecting data in transit and at rest)
Auditing (tracking who accessed what and when)

9. Transaction Management The DBMS ensures reliable processing of business operations through ACID properties:

Atomicity: Transactions are all-or-nothing. If a funds transfer consists of debiting account A and crediting account B, either both operations occur or neither does. Partial execution is impossible.

Consistency: Transactions transform the database from one consistent state to another. All defined integrity constraints hold before and after transaction execution.

Isolation: Concurrent transactions appear to execute sequentially. Even with hundreds of simultaneous users, each transaction sees a consistent database state as if it were the only active transaction.

Durability: Once committed, transaction effects persist even through system failures. If the power fails immediately after confirming a funds transfer, the transfer remains complete when the system restarts.

10. Scalability Database systems can grow with organizational needs. Vertical scalability adds more powerful hardware (more CPUs, memory, faster storage). Horizontal scalability distributes data across multiple servers (sharding, replication, partitioning).

1.4 Data Models Overview

Data models provide abstractions for describing data, relationships, and constraints. Different models emphasize different aspects of data management.

Hierarchical Model (1960s) IBM's IMS popularized the hierarchical model, organizing data as tree structures. Each record has a single parent (except root records) and can have multiple children. This 1:N relationship structure worked well for Bill of Materials (a product consists of subassemblies, which consist of parts) but struggled with many-to-many relationships.

Example: A university hierarchical database might structure data as:

University (root)
- Colleges (children of University)
  - Departments (children of Colleges)
    - Faculty (children of Departments)
    - Courses (children of Departments)
      - Students (children of Courses)

Problems arise when a student takes multiple courses—the student would appear multiple times in the tree, introducing redundancy and update anomalies.

Network Model (1970s) The CODASYL (Conference on Data Systems Languages) network model extended the hierarchical approach to support many-to-many relationships using sets. Records could have multiple parents, creating a network structure. While more flexible than hierarchical model, the network model required programmers to navigate complex pointer structures, making application development challenging.

Relational Model (1970s-Present) Codd's relational model organizes data into tables (relations) with rows (tuples) and columns (attributes). Relationships are represented through common values rather than explicit pointers. The model's mathematical foundation in set theory and predicate logic provides powerful query capabilities through relational algebra and calculus.

Advantages:

Simple, intuitive representation
Strong theoretical foundation
Data independence through logical views
Declarative query languages (SQL)
Well-understood normalization theory

Entity-Relationship Model (1976) Peter Chen's ER model provides a conceptual design tool bridging the gap between real-world requirements and database implementation. Entities represent objects (Student, Course), attributes describe entity properties (student_name, course_credits), and relationships connect entities (Enrollment connects Students and Courses).

ER diagrams use standardized notation:

Rectangles for entities
Diamonds for relationships
Ovals for attributes
Lines with cardinality indicators (1:1, 1:N, M:N)

Object-Oriented Model (1980s-1990s) Object-oriented databases attempted to address impedance mismatch between OOP languages and relational databases. Objects encapsulate both data and behavior, supporting inheritance, polymorphism, and complex data types. While influential, OODBMS never achieved mainstream adoption outside specialized domains like CAD/CAM and telecommunications.

Object-Relational Model (1990s-Present) Modern relational databases incorporate object-oriented features: user-defined types, inheritance, methods, and complex data structures. PostgreSQL pioneered this approach, supporting array types, composite types, and custom functions. Oracle's object-relational features include object tables, REF types, and method definitions.

Semantic Models (1990s-Present) Semantic data models emphasize meaning and relationships. The Resource Description Framework (RDF) represents information as subject-predicate-object triples, enabling semantic web applications. Knowledge graphs and ontology-based models support AI and machine learning applications by capturing domain knowledge and inference rules.

NoSQL Models (2000s-Present) The NoSQL movement encompasses multiple data models:

Key-Value Stores: Simple, high-performance systems storing arbitrary data under unique keys. Redis and Amazon DynamoDB exemplify this model, supporting millions of operations per second.

Document Stores: Semi-structured data in JSON, BSON, or XML formats. MongoDB's document model allows nested structures and varying schemas within the same collection, supporting agile development and evolving requirements.

Column-Family Stores: Google BigTable and Apache Cassandra organize data by column families, optimizing for wide tables with many attributes. This model excels at time-series data, sensor readings, and analytics workloads.

Graph Databases: Neo4j and Amazon Neptune store nodes (entities) and edges (relationships), optimizing for connected data queries like social networks, recommendation engines, and fraud detection.

1.5 Database Users and Roles

Understanding the various stakeholders in a database environment helps clarify requirements and responsibilities.

Database Administrators (DBAs) DBAs are the technical guardians of the database environment. Their responsibilities span:

Installation and Configuration: Setting up DBMS software, configuring memory parameters, storage allocation, and network settings. A DBA might configure Oracle's SGA (System Global Area) size based on available memory and workload characteristics.

Security Management: Creating user accounts, assigning roles, managing privileges, and implementing encryption. DBAs establish authentication mechanisms (passwords, LDAP, Kerberos) and authorization policies.

Backup and Recovery: Designing backup strategies (full, incremental, differential), scheduling backups, testing restore procedures, and recovering from failures. A banking DBA might implement continuous archiving to support point-in-time recovery.

Performance Monitoring: Tracking system performance, identifying bottlenecks, tuning queries, and optimizing indexes. DBAs use tools like Oracle Enterprise Manager, PostgreSQL pg_stat_statements, and custom monitoring scripts.

Capacity Planning: Forecasting growth, planning hardware upgrades, and ensuring adequate resources. DBAs analyze trends in data volume, transaction rates, and user concurrency.

Schema Management: Designing and maintaining database structures, implementing changes through controlled procedures, and ensuring referential integrity.

Database Designers Designers create the logical and physical database structure. Working with application developers and end users, they:

Analyze requirements and create conceptual data models
Transform conceptual models into normalized relational schemas
Design indexes, partitioning strategies, and storage layouts
Document design decisions and constraints

Application Developers Developers build applications that interact with databases. Their responsibilities include:

Writing SQL queries and stored procedures
Implementing data access layers in programming languages (Java, Python, C#)
Managing connections and transaction boundaries
Handling errors and exceptional conditions
Optimizing application-database interaction

End Users End users consume database information through applications:

Naive Users: Use applications with pre-defined interfaces. An airline reservation agent enters passenger information without writing queries or understanding database structure.

Sophisticated Users: Query databases directly using SQL or reporting tools. Business analysts might write complex queries to analyze sales trends or customer behavior.

Specialized Users: Build database-backed applications. Data scientists extract training data for machine learning models. ETL developers design data integration pipelines.

System Analysts Analysts bridge business requirements and technical implementation. They:

Gather and document user requirements
Create use cases and business process models
Validate database designs against business needs
Ensure compliance with regulations and policies

Data Architects Architects focus on enterprise-wide data strategy:

Establish data standards and best practices
Design integration patterns across systems
Plan data migration and consolidation strategies
Govern data quality and master data management

1.6 Database Architecture (ANSI/SPARC)

The ANSI/SPARC (American National Standards Institute/Standards Planning and Requirements Committee) architecture, proposed in 1975, established a three-level framework for database systems that remains influential today.

Motivation for Three-Level Architecture Before ANSI/SPARC, database systems tightly coupled user views with physical storage. Any change to storage structures required modifying applications. The three-level architecture aimed to provide:

Data Independence: Applications insulated from physical storage changes
Multiple Views: Different users see customized data representations
Centralized Control: DBMS manages all data access and integrity

1.7 Three-Level Architecture

The ANSI/SPARC architecture comprises three distinct levels:

External Level (User Views) The external level describes how individual users perceive the database. Each user group has its own external schema (subschema) showing only relevant data. Examples:

A payroll clerk's external view might include:

CREATE VIEW PayrollView AS
SELECT employee_id, employee_name, department, salary, bank_account
FROM employees
WHERE status = 'ACTIVE';

A project manager's external view might show:

CREATE VIEW ProjectAssignmentView AS
SELECT e.employee_name, p.project_name, a.hours_allocated,
       a.start_date, a.end_date
FROM employees e, projects p, assignments a
WHERE e.employee_id = a.employee_id
  AND p.project_id = a.project_id;

Benefits of external schemas:

Security: Users see only authorized data
Simplicity: Complex underlying structures hidden
Customization: Data presented in most useful format
Independence: Underlying changes don't affect views if structure preserved

Conceptual Level (Community View) The conceptual level describes the entire database structure without physical details. It represents:

All entities, attributes, and relationships
Integrity constraints and business rules
Security and authorization requirements
Data independence from storage considerations

A conceptual schema for a university might define:

CREATE TABLE students (
    student_id INTEGER PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    date_of_birth DATE,
    email VARCHAR(200) UNIQUE,
    enrollment_date DATE DEFAULT CURRENT_DATE
);

CREATE TABLE courses (
    course_id CHAR(10) PRIMARY KEY,
    title VARCHAR(200) NOT NULL,
    credits INTEGER CHECK (credits BETWEEN 1 AND 6),
    department VARCHAR(50)
);

CREATE TABLE enrollments (
    student_id INTEGER REFERENCES students,
    course_id CHAR(10) REFERENCES courses,
    semester CHAR(6),
    grade CHAR(2),
    PRIMARY KEY (student_id, course_id, semester)
);

Internal Level (Physical Storage) The internal level describes how data is physically stored and accessed:

File organizations (heap, sequential, hash)
Index structures (B-trees, hash indexes)
Storage allocations (tablespaces, data files)
Compression and encryption methods
Data placement across disks

Example internal specifications:

Table: students
Storage Organization: Heap file with clustering on student_id
Indexes: 
  - Primary B-tree index on student_id (clustered)
  - Secondary hash index on email
Tablespace: USER_DATA01 on disk group DATA_DG1
Row Format: Variable length with compression
Blocksize: 8KB

1.8 Data Independence

Data independence—the immunity of applications to changes in data representation—is a fundamental benefit of the three-level architecture.

Physical Data Independence Applications and conceptual schemas remain unchanged when physical storage changes. Examples of physical changes that shouldn't affect applications:

Index Changes: Adding or dropping indexes to improve performance
File Organization: Switching from heap to hash organization
Storage Devices: Moving from HDD to SSD storage
Partitioning: Splitting a table across multiple disks
Compression: Implementing or changing compression algorithms
Reorganization: Rebuilding tables to eliminate fragmentation

Logical Data Independence External schemas (user views) remain unchanged when conceptual schema changes. Examples of logical changes that shouldn't affect external views:

Adding New Attributes: Adding columns to tables
Creating New Tables: Introducing new entity types
Splitting Tables: Decomposing tables for normalization
Merging Tables: Combining tables for performance
Adding Relationships: Introducing new associations

Achieving logical data independence is more challenging than physical independence. View mechanisms must shield users from underlying changes through view definitions that map external to conceptual structures.

Mappings Between Levels The DBMS maintains mappings connecting the three levels:

External/Conceptual Mapping: Translates user queries on external views to operations on conceptual schema. When a payroll clerk queries the PayrollView, the DBMS maps this to appropriate conceptual-level tables.

Conceptual/Internal Mapping: Translates conceptual operations to physical storage access. When the conceptual layer requests employee records, the internal mapping determines which files, blocks, and indexes to access.

These mappings enable data independence—changes at one level require only mapping adjustments, not application modifications.

1.9 Database System Environment

A complete database system comprises multiple interacting components:

Hardware Components:

Processors: CPUs executing DBMS code and application queries
Memory: RAM for buffer caches, sort areas, and program execution
Storage: Persistent storage (HDD, SSD) for data files, logs, and backups
Network: Interconnects allowing client-server communication

Software Components:

DBMS Software:

Query Processor: Parses, validates, optimizes, and executes queries
Storage Manager: Manhes data files, buffer pools, and disk I/O
Transaction Manager: Coordinates concurrent access and recovery
Lock Manager: Implements concurrency control protocols
Log Manager: Records changes for recovery purposes
Catalog Manager: Maintains metadata about database objects

Supporting Software:

Operating System: Provides process scheduling, file management, networking
Network Software: Enables client-server communication
Application Programs: Provide user interfaces and business logic

People Components:

Database administrators
Application developers
End users
System support staff

Procedures:

Startup and shutdown procedures
Backup and recovery processes
Security audit protocols
Performance monitoring routines
Change management processes

1.10 Overview of Modern DBMS Examples

Oracle Database Oracle, first released in 1979, remains the market leader in enterprise database systems. Key features:

Multitenant Architecture: Pluggable databases allow consolidation of multiple databases into a single container database, simplifying management and resource allocation.

Real Application Clusters (RAC): Multiple instances access a single database, providing high availability and scalability. If one node fails, others continue processing.

Automatic Storage Management (ASM): Volume manager and file system optimized for database files, providing striping, mirroring, and online reconfiguration.

Advanced Security: Transparent Data Encryption (TDE), Data Redaction, and Database Vault protect sensitive information.

Oracle Exadata: Database machine integrating hardware and software for extreme performance. Smart Scan offloads query processing to storage servers.

MySQL MySQL, acquired by Sun Microsystems (later Oracle), dominates web applications. Features:

Storage Engine Architecture: Pluggable storage engines (InnoDB, MyISAM, Memory) allow workload-specific optimization. InnoDB provides ACID compliance with row-level locking; MyISAM offers table-level locking with full-text search.

Replication: Master-slave replication supports read scaling and high availability. Group replication provides multi-master capabilities.

Partitioning: Range, list, hash, and key partitioning distribute data for manageability and performance.

JSON Support: Native JSON data type with efficient storage and indexing of JSON documents.

PostgreSQL PostgreSQL, the "world's most advanced open source database," originated from UC Berkeley's POSTGRES project. Features:

Extensibility: Users can define data types, operators, index methods, and functions. PostGIS extends PostgreSQL with geospatial capabilities.

MVCC Implementation: Multiversion concurrency control provides snapshot isolation without read locks.

Advanced Indexing: B-tree, Hash, GiST, SP-GiST, GIN, and BRIN indexes support diverse query patterns.

Table Partitioning: Declarative partitioning simplifies management of large tables.

Logical Replication: Publish-subscribe replication enables selective data distribution and upgrades with minimal downtime.

Microsoft SQL Server SQL Server, first released for OS/2 in 1989, evolved into a comprehensive data platform. Features:

Integration with Microsoft Ecosystem: Tight integration with Visual Studio, .NET, Azure, and Power BI.

Columnstore Indexes: Optimize analytical workloads by storing data column-wise rather than row-wise.

In-Memory OLTP: Memory-optimized tables and natively compiled stored procedures accelerate transaction processing.

Always On Availability Groups: Provide high availability and disaster recovery with readable secondaries.

Intelligent Query Processing: Adaptive joins, batch mode, and interleaved execution optimize query performance.

Chapter 2 — Data Models and Data Modeling

2.1 Concept of Data Models

Data models are abstractions that describe how data is represented, organized, and manipulated. They provide the conceptual tools for database design and implementation.

Purpose of Data Models:

Communication: Data models facilitate communication between stakeholders—business users, analysts, and technical implementers. An entity-relationship diagram conveys meaning without requiring technical database knowledge.
Precision: Formal modeling notation ensures precise specification of requirements, reducing ambiguity and misinterpretation.
Documentation: Models serve as living documentation of the database structure, supporting maintenance, evolution, and knowledge transfer.
Design Validation: Before investing in implementation, models allow validation of requirements, identification of missing elements, and discovery of inconsistencies.
Implementation Blueprint: Models guide physical database creation, providing specifications for tables, constraints, and indexes.

Levels of Data Models:

Conceptual Data Models: High-level representations focusing on business concepts and relationships, independent of implementation details. Used primarily for requirements analysis and communication with business stakeholders.

Logical Data Models: More detailed representations specifying data structures, attributes, keys, and constraints, but still independent of specific DBMS products. Represent what the system should contain.

Physical Data Models: Implementation-specific models describing storage structures, indexes, partitioning, and performance optimizations for a particular DBMS. Represent how data will be stored and accessed.

2.2 Object-Based Logical Models

Object-based logical models describe data at the conceptual and logical levels using concepts close to how humans perceive reality.

Entity-Relationship Model: The most widely used object-based model, describing the world in terms of entities (things) and relationships (associations).

Object-Oriented Model: Extends object-oriented programming concepts to databases. Objects encapsulate state (attributes) and behavior (methods). Classes define object types, supporting inheritance hierarchies.

Semantic Models: Emphasize the meaning of data and relationships. The NIAM (Nijssen's Information Analysis Methodology) and ORM (Object-Role Modeling) approaches focus on facts and roles rather than entities and attributes.

2.3 Record-Based Logical Models

Record-based models structure data as fixed-format records, forming the basis for most implemented database systems.

Relational Model: Data organized as tables (relations) with rows (tuples) and columns (attributes). Relationships represented through common values rather than explicit pointers.

Network Model: Data organized as record types connected in a network structure using sets. Each record can participate in multiple set relationships, allowing many-to-many associations.

Hierarchical Model: Data organized as tree structures with parent-child relationships. Each child has exactly one parent, limiting flexibility but providing efficient access for hierarchical data.

2.4 Physical Data Models

Physical data models describe how data is actually stored and accessed on storage media.

Storage Structures:

Heap files: Records stored without any particular order
Sequential files: Records stored in key sequence
Hashing: Records placed in buckets based on hash function
Clustering: Related records stored together physically

Access Methods:

B-Trees: Balanced tree structures supporting efficient search, insert, delete
Hash Indexes: Direct access based on hash function
Bitmap Indexes: Bit arrays for low-cardinality columns
R-Trees: Spatial access methods for multidimensional data

Storage Allocation:

Extents: Contiguous blocks allocated to database objects
Segments: Logical storage units containing all extents for an object
Tablespaces: Logical containers grouping related segments

2.5 Conceptual vs Logical vs Physical Design

Conceptual Design Phase:

Activities:

Identify entities and relationships
Determine business rules and constraints
Create ER diagrams
Validate with business stakeholders

Deliverable: Conceptual data model (ER diagram, business rules documentation)

Example for university database:

Entities: Student, Professor, Course, Department, Classroom
Relationships: Enrolls (Student-Course), Teaches (Professor-Course), Belongs (Student-Department)
Business rules: A student must declare a major department; A course can have multiple sections

Logical Design Phase:

Activities:

Transform conceptual model to relational schema
Normalize relations (1NF, 2NF, 3NF, BCNF)
Define primary keys, foreign keys, and constraints
Specify views and access patterns

Deliverable: Relational schema with normalized tables, constraints, and views

Example transformations:

Many-to-many relationship (Student-Course) becomes separate Enrollment table
Multivalued attributes become separate tables
Inheritance hierarchies mapped using various strategies

Physical Design Phase:

Activities:

Choose storage structures and file organizations
Design indexes based on query patterns
Determine partitioning and clustering strategies
Estimate storage requirements and plan capacity

Deliverable: Physical database design with DDL scripts, indexing strategy, storage configuration

Example physical decisions:

Create clustered index on student_id for Student table
Partition Enrollment table by semester for better manageability
Add bitmap indexes on low-cardinality columns (gender, status)

2.6 Abstraction and Generalization

Abstraction and generalization are fundamental cognitive tools for managing complexity in data modeling.

Abstraction focuses on essential characteristics while ignoring irrelevant details. In data modeling, abstraction allows us to represent real-world phenomena at appropriate levels of detail.

Types of abstraction:

Classification: Grouping objects with common properties into classes. The individual student "John Smith" is an instance of the class "Student."

Aggregation: Treating a collection of related objects as a higher-level object. A "Course" aggregates multiple "Enrollment" instances and a "Professor."

Generalization: Extracting common properties from multiple classes into a superclass. "Professor" and "Student" generalize to "Person."

Specialization: Defining subclasses with additional specific properties. "Person" specializes to "Employee" and "Student."

2.7 Entity-Relationship (ER) Model

Peter Chen's Entity-Relationship model, introduced in 1976, remains the most influential conceptual modeling approach.

Entities and Entity Sets:

An entity is a distinct object in the real world distinguishable from other objects. An entity set is a collection of similar entities.

Examples:

Entity: Student with ID 12345
Entity set: All students enrolled at the university

Strong vs Weak Entities:

Strong entities exist independently. Weak entities depend on strong entities for identification.

Example: A "Room" might be a weak entity dependent on "Building"—room numbers might repeat across buildings, requiring building identification for uniqueness.

Attributes:

Attributes describe properties of entities or relationships:

Simple vs Composite: Simple attributes are atomic (age); composite attributes have components (address: street, city, zip)
Single-valued vs Multivalued: Single-valued attributes have one value (birth_date); multivalued can have multiple (phone_numbers)
Stored vs Derived: Stored attributes persist in database (birth_date); derived are computed (age from birth_date and current date)
Null-valued: Attributes that may not have values for all entities

Keys:

Superkey: Set of attributes uniquely identifying entities
Candidate key: Minimal superkey (no proper subset is a superkey)
Primary key: Selected candidate key for entity identification
Foreign key: Attribute(s) referencing primary key of another entity

Relationships and Relationship Sets:

A relationship is an association among entities. A relationship set is a collection of similar relationships.

Degree of Relationship:

Unary (recursive): Relationship involving same entity set (e.g., "manages" between Employee and Employee)
Binary: Relationship between two entity sets (most common)
Ternary: Relationship among three entity sets (e.g., "prescribes" among Doctor, Patient, Medication)

Cardinality Constraints:

Specify how many instances of one entity can relate to another:

One-to-One (1:1): Each entity in set A relates to at most one in set B
One-to-Many (1:N): Each entity in A relates to many in B; each in B relates to at most one in A
Many-to-Many (M:N): Entities in both sets can relate to many in the other

Participation Constraints:

Total participation (existence dependency): Every entity in set must participate in relationship
Partial participation: Entities may or may not participate

Example: Every student must have an advisor (total participation); not every professor advises students (partial participation)

ER Diagram Notation:

Standard Chen notation:

Rectangles: Entity sets
Diamonds: Relationship sets
Ovals: Attributes
Underlined attributes: Primary keys
Double ovals: Multivalued attributes
Dashed ovals: Derived attributes
Double rectangles: Weak entity sets
Double diamonds: Identifying relationships
Lines connecting components with cardinality indicators

2.8 Extended ER (EER) Model

The Extended Entity-Relationship model adds semantic concepts to handle more complex modeling situations.

Specialization and Generalization:

Specialization is the process of defining subclasses of an entity type. A "Student" entity might specialize to "GraduateStudent" and "UndergraduateStudent."

Generalization is the reverse process—identifying common features among entity types to define a superclass. "Professor" and "Student" generalize to "Person."

Constraints on Specialization/Generalization:

Disjointness constraint: Subclasses may be disjoint (an entity belongs to at most one) or overlapping (can belong to multiple)
Completeness constraint: Total specialization means every superclass entity must belong to some subclass; partial allows entities without subclass membership

Union Types (Categories):

A union type represents a superclass formed from multiple entity types with different keys. Example: A "VehicleOwner" might be a Person or a Company.

Aggregation:

Treating a relationship as an entity for participation in other relationships. If "ProjectAssignment" (relationship between Employee and Project) needs to relate to "Budget," we aggregate the relationship into an entity.

2.9 UML for Database Modeling

Unified Modeling Language (UML), primarily designed for software engineering, provides class diagrams suitable for data modeling.

Class Notation:

Classes: Rectangles with name, attributes, operations
Attributes: Shown with visibility, type, multiplicity
Associations: Lines between classes with role names, multiplicity

Advantages of UML:

Integration with object-oriented design
Support for operations (methods) not just data
Rich constraint specification (OCL)
Widely understood by software developers

2.10 Case Study: Designing a University Database

Requirements Analysis:

A university needs a database to manage:

Student information (personal details, enrollment, academic records)
Course offerings (catalog, sections, schedules)
Faculty (instructors, advisors, department affiliations)
Registration (enrollments, grades, transcripts)
Facilities (classrooms, buildings, equipment)

Conceptual Design:

Identify entities and relationships:

Entities:

Person (abstract superclass)
- Student (subclass)
- Faculty (subclass)
- Staff (subclass)
Department
Course
CourseSection
Classroom
Building

Relationships:

Person works_for Department
Student majors_in Department
Faculty belongs_to Department
Course offered_by Department
Faculty teaches CourseSection
Student enrolls_in CourseSection (with Grade)
CourseSection meets_in Classroom
Classroom located_in Building

Attributes:

Person: person_id (PK), name, birth_date, address, phone, email Student: student_id (PK), classification, enrollment_date, GPA (derived) Faculty: faculty_id (PK), rank, hire_date, office_location Department: dept_code (PK), name, office_phone, budget Course: course_id (PK), title, credits, description CourseSection: section_id (PK), semester, year, capacity, schedule Classroom: room_id (PK), room_number, capacity, type Building: building_id (PK), name, address, floors

Cardinality Constraints:

Student majors_in Department (N:1 - many students in one department)
Faculty belongs_to Department (N:1)
Department employs Faculty (1:N)
Faculty teaches CourseSection (1:N - assuming one instructor per section)
Student enrolls_in CourseSection (M:N - many students, many sections)
CourseSection meets_in Classroom (N:1 - many sections in one classroom)

EER Extensions:

Generalization: Person is superclass of Student, Faculty, Staff (overlapping, total)

Logical Design (Relational Schema):

Transform to relational model:

CREATE TABLE persons (
    person_id INTEGER PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    birth_date DATE,
    address TEXT,
    phone VARCHAR(20),
    email VARCHAR(100) UNIQUE,
    person_type CHAR(1) CHECK (person_type IN ('S', 'F', 'T'))  -- discriminator
);

CREATE TABLE students (
    student_id INTEGER PRIMARY KEY REFERENCES persons(person_id),
    classification VARCHAR(20),
    enrollment_date DATE NOT NULL
);

CREATE TABLE faculty (
    faculty_id INTEGER PRIMARY KEY REFERENCES persons(person_id),
    rank VARCHAR(30),
    hire_date DATE NOT NULL,
    office_location VARCHAR(50)
);

CREATE TABLE departments (
    dept_code CHAR(4) PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    office_phone VARCHAR(20),
    budget DECIMAL(10,2)
);

CREATE TABLE courses (
    course_id CHAR(8) PRIMARY KEY,
    title VARCHAR(200) NOT NULL,
    credits INTEGER CHECK (credits BETWEEN 1 AND 6),
    description TEXT,
    dept_code CHAR(4) REFERENCES departments
);

CREATE TABLE course_sections (
    section_id INTEGER PRIMARY KEY,
    course_id CHAR(8) REFERENCES courses,
    semester CHAR(6) NOT NULL,  -- e.g., '202401' for Spring 2024
    year INTEGER NOT NULL,
    capacity INTEGER CHECK (capacity > 0),
    schedule VARCHAR(50),  -- e.g., 'MWF 10:00-10:50'
    faculty_id INTEGER REFERENCES faculty,
    classroom_id INTEGER REFERENCES classrooms
);

CREATE TABLE enrollments (
    student_id INTEGER REFERENCES students,
    section_id INTEGER REFERENCES course_sections,
    grade CHAR(2),
    enrollment_date DATE DEFAULT CURRENT_DATE,
    PRIMARY KEY (student_id, section_id)
);

CREATE TABLE buildings (
    building_id INTEGER PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    address TEXT,
    floors INTEGER
);

CREATE TABLE classrooms (
    classroom_id INTEGER PRIMARY KEY,
    room_number VARCHAR(10) NOT NULL,
    capacity INTEGER,
    room_type VARCHAR(30),
    building_id INTEGER REFERENCES buildings
);

-- Additional relationships
CREATE TABLE majors (
    student_id INTEGER REFERENCES students,
    dept_code CHAR(4) REFERENCES departments,
    declared_date DATE,
    PRIMARY KEY (student_id, dept_code)
);

CREATE TABLE department_employees (
    person_id INTEGER REFERENCES persons,
    dept_code CHAR(4) REFERENCES departments,
    start_date DATE,
    end_date DATE,
    role VARCHAR(50),
    PRIMARY KEY (person_id, dept_code, start_date)
);

Physical Design:

Consider indexing and performance:

-- Clustered index on primary keys (automatically in many DBMS)
-- Additional indexes for foreign keys
CREATE INDEX idx_enrollments_student ON enrollments(student_id);
CREATE INDEX idx_enrollments_section ON enrollments(section_id);
CREATE INDEX idx_sections_course ON course_sections(course_id);
CREATE INDEX idx_sections_faculty ON course_sections(faculty_id);
CREATE INDEX idx_sections_classroom ON course_sections(classroom_id);

-- Index for search by name
CREATE INDEX idx_persons_name ON persons(name);

-- Composite index for semester lookups
CREATE INDEX idx_sections_semester ON course_sections(semester, year);

-- Partitioning strategy
-- Partition enrollments by year (assuming semester contains year)
-- Partition course_sections by semester

-- Tablespace allocation
-- Active data on fast SSD storage
-- Historical data on slower HDD storage

This design provides a foundation for a complete university information system, supporting registration, grade recording, scheduling, and reporting functions.

Chapter 3 — Relational Data Model

3.1 Structure of Relational Databases

The relational model, proposed by E.F. Codd in 1970, organizes data into relations (tables) with a rigorous mathematical foundation. This structure provides both simplicity for users and formal precision for system implementers.

Mathematical Foundation:

A relation is defined as a subset of the Cartesian product of a list of domains. Formally, if D₁, D₂, ..., Dₙ are domains (sets of values), then a relation R is a set of n-tuples (d₁, d₂, ..., dₙ) such that d₁ ∈ D₁, d₂ ∈ D₂, ..., dₙ ∈ Dₙ.

This mathematical definition ensures:

Each relation has a fixed arity (number of attributes)
Tuples are unique (no duplicate rows)
Attribute values come from specific domains

Relation Schema vs Relation Instance:

A relation schema describes the structure: relation name, attribute names, and domains. Example:

Student(sid: integer, sname: string, major: string, credits: integer)

A relation instance is the actual data at a particular moment—a set of tuples conforming to the schema. Instances change over time as data is inserted, updated, or deleted.

Properties of Relations:

Order Irrelevance: The ordering of tuples is immaterial. A relation is a set, not a list.
No Duplicate Tuples: As a set, a relation cannot contain identical tuples.
Attribute Atomicity: Attribute values must be atomic (indivisible) with respect to the relational model. This is the first normal form condition.
Tuple Uniqueness: Each tuple must be uniquely identifiable, typically through keys.

3.2 Relations, Tuples, Attributes

Understanding the precise terminology of the relational model is essential.

Relation (Table): A relation is a named, two-dimensional table representing an entity set or relationship set. The term "table" is commonly used, though strictly speaking, tables may have ordered rows and duplicate rows—violating pure relational model.

Tuple (Row): A tuple is an ordered set of attribute values, representing one entity or relationship instance. The order of values corresponds to the attribute order in the relation schema.

Attribute (Column): An attribute is a named column of a relation, representing a property of the entity or relationship. Each attribute has a domain specifying permissible values.

Degree (Arity): The number of attributes in a relation. A relation of degree 1 is unary, degree 2 binary, degree 3 ternary, and so on.

Cardinality: The number of tuples in a relation instance. Cardinality changes as data is inserted or deleted.

Example:

Relation: Student (degree = 4)
Attributes: sid (integer), sname (string), major (string), credits (integer)

Instance (cardinality = 3):
sid | sname    | major     | credits
----|----------|-----------|--------
101 | Alice    | Computer Science | 45
102 | Bob      | Mathematics | 30
103 | Carol    | Physics    | 52

3.3 Domains and Constraints

Domains:

A domain is a set of atomic values of a particular type. Domains specify the permissible values for attributes and provide the first level of data integrity.

Examples of domains:

Integer: all integers within specified range
VARCHAR(20): character strings up to 20 characters
DATE: valid calendar dates
Boolean: TRUE or FALSE
ENUM('Freshman', 'Sophomore', 'Junior', 'Senior'): enumerated values

Domains can have associated constraints:

NOT NULL: values must be provided
CHECK: additional restrictions (e.g., age >= 0)
DEFAULT: value used if none specified

Integrity Constraints:

Integrity constraints are rules that must hold for all instances of the database. They ensure data correctness and consistency.

Domain Constraints: Ensure attribute values come from proper domain.

Key Constraints: Ensure tuples are uniquely identifiable.

Entity Integrity: Primary key attributes cannot be null.

Referential Integrity: Foreign key values must match existing primary key values or be null.

General Constraints: Arbitrary business rules (e.g., salary must not exceed manager's salary).

3.4 Keys (Primary, Candidate, Super, Foreign)

Keys are fundamental to the relational model, providing tuple identification and relationship representation.

Superkey: A set of attributes that uniquely identifies tuples in a relation. Any superset of a superkey is also a superkey.

Example: In Student relation, {sid} is a superkey. So is {sid, sname}—though unnecessarily large.

Candidate Key: A minimal superkey—no proper subset is a superkey. A relation may have multiple candidate keys.

Example: In Student relation, if email is also unique, both {sid} and {email} are candidate keys.

Primary Key: The candidate key chosen for tuple identification. Primary keys:

Cannot contain null values
Should be stable (rarely change)
Should be minimal (single attribute when possible)
Are often indexed for fast access

Foreign Key: An attribute (or set of attributes) in one relation that references the primary key of another relation. Foreign keys represent relationships between relations.

Example:

Enrollment(sid, cid, grade)
-- sid is foreign key referencing Student(sid)
-- cid is foreign key referencing Course(cid)

Alternate Key: Candidate keys not chosen as primary key.

Composite Key: A key consisting of multiple attributes.

Natural Key vs Surrogate Key:

Natural key: Uses existing attributes (e.g., Social Security Number)
Surrogate key: Artificially generated (e.g., auto-increment integer)

Surrogate keys are often preferred because they are stable, simple, and independent of business changes.

3.5 Integrity Constraints

Entity Integrity:

The entity integrity rule states that no primary key attribute can accept null values. This ensures every tuple is uniquely identifiable.

Example violation:

INSERT INTO Student(sid, sname, major, credits) 
VALUES (NULL, 'David', 'History', 24);  -- Invalid: primary key is null

Referential Integrity:

The referential integrity rule ensures that foreign key values either:

Match a primary key value in the referenced relation, or
Are completely null (if the foreign key allows nulls)

Example violation:

INSERT INTO Enrollment(sid, cid, grade) 
VALUES (999, 'CS101', 'A');  -- Invalid if no student with sid=999 exists

Referential Integrity Actions:

When referenced data changes, referential integrity defines how to respond:

ON DELETE CASCADE: Delete referencing tuples when referenced tuple deleted ON DELETE SET NULL: Set foreign key to null when referenced tuple deleted ON DELETE SET DEFAULT: Set foreign key to default value ON DELETE RESTRICT/NO ACTION: Prevent deletion if references exist ON UPDATE CASCADE: Propagate key changes to foreign keys ON UPDATE SET NULL/SET DEFAULT: Similar to delete options

General Constraints:

General constraints enforce business rules beyond key and referential integrity.

Attribute-based CHECK constraints:

CREATE TABLE Employee (
    emp_id INTEGER PRIMARY KEY,
    salary DECIMAL(10,2) CHECK (salary >= 0),
    birth_date DATE CHECK (birth_date < CURRENT_DATE)
);

Tuple-based CHECK constraints:

CREATE TABLE ProjectAssignment (
    emp_id INTEGER REFERENCES Employee,
    project_id INTEGER REFERENCES Project,
    hours_allocated DECIMAL(5,2),
    start_date DATE,
    end_date DATE,
    CHECK (start_date <= end_date),
    CHECK (hours_allocated > 0)
);

Assertions (table-independent constraints):

CREATE ASSERTION salary_ratio CHECK (
    NOT EXISTS (
        SELECT * FROM Employee e1, Employee e2
        WHERE e1.department = e2.department
        AND e1.salary > e2.salary * 10
    )
);

3.6 Relational Algebra

Relational algebra provides a formal foundation for querying relational databases. It consists of operations that take relations as input and produce new relations as output—the closure property enables composition of operations.

Fundamental Operations:

Selection (σ): Selects tuples satisfying a condition. σcondition(R) returns all tuples in R that satisfy condition.

Example: σmajor='CS'(Student) returns all CS students.

Projection (π): Selects specified attributes, removing duplicates. πattributes(R) returns relation with only named attributes.

Example: πsname,major(Student) returns student names and majors.

Union (∪): Returns tuples in either of two union-compatible relations. R ∪ S requires R and S have same attributes and domains.

Set Difference (−): Returns tuples in first relation but not second. R − S requires union compatibility.

Cartesian Product (×): Combines every tuple of first relation with every tuple of second. R × S produces tuples combining attributes of both relations.

Rename (ρ): Changes relation or attribute names. ρnewname(R) or ρnewattributes(R)

Additional Operations:

Intersection (∩): Tuples in both relations (derived from difference). R ∩ S = R − (R − S)

Join Operations:

Theta Join (⋈θ): Cartesian product followed by selection. R ⋈θ S = σθ(R × S)

Equijoin: Theta join with equality condition.

Natural Join (⋈): Equijoin over all common attributes, with duplicate columns removed.

Outer Joins: Preserve tuples even when no match exists.

Left outer join (⟕): Preserves all left relation tuples
Right outer join (⟖): Preserves all right relation tuples
Full outer join (⟗): Preserves all tuples from both

Division (÷): Useful for "all" queries (e.g., students who take all courses). R ÷ S finds tuples in R that combine with all tuples in S.

Example Queries:

Given schema: Student(sid, sname, major) Course(cid, ctitle, credits) Enrollment(sid, cid, grade)

Query 1: Find names of CS students πsname(σmajor='CS'(Student))

Query 2: Find IDs of students enrolled in 'Database' course πEnrollment.sid(σctitle='Database'(Course ⋈ Enrollment))

Query 3: Find names of students who got 'A' in any course πsname(Student ⋈ (σgrade='A'(Enrollment)))

Query 4: Find students who take all courses πsid,sname(Student ⋈ ( (πsid,cid(Enrollment) ÷ πcid(Course)) ))

3.7 Relational Calculus (TRC & DRC)

Relational calculus provides a declarative alternative to relational algebra—specifying what to retrieve rather than how to retrieve it.

Tuple Relational Calculus (TRC):

TRC expressions have form: { t | P(t) } where t is a tuple variable and P is a formula.

Example TRC queries:

Query 1: Find all students { s | Student(s) }

Query 2: Find names of CS students { t.sname | Student(t) AND t.major = 'CS' }

Query 3: Find students enrolled in 'CS101' { s | Student(s) AND ∃ e (Enrollment(e) AND e.sid = s.sid AND e.cid = 'CS101') }

Query 4: Find students who take all courses { s | Student(s) AND ∀ c (Course(c) ⇒ ∃ e (Enrollment(e) AND e.sid = s.sid AND e.cid = c.cid)) }

Domain Relational Calculus (DRC):

DRC uses domain variables instead of tuple variables. Expressions have form: { <x₁, ..., xₙ> | P(x₁, ..., xₙ) }

Example DRC queries:

Query 1: Find all student IDs and names { <s, n> | ∃ m, c (Student(s, n, m, c)) }

Query 2: Find names of CS students { | ∃ s, m, c (Student(s, n, m, c) AND m = 'CS') }

Query 3: Find students enrolled in 'CS101' { <s, n> | ∃ m, c (Student(s, n, m, c)) AND ∃ g (Enrollment(s, 'CS101', g)) }

Safety of Expressions:

Both TRC and DRC must ensure expressions produce finite results. Unsafe expressions (e.g., { t | NOT Student(t) }) would generate infinite results and are disallowed.

3.8 Views and Relational Model

Views are virtual relations defined by queries, providing customized perspectives on the database.

Types of Views:

Virtual Views: Stored definition only; materialized when queried. Materialized Views: Physically stored and maintained; faster access but requires maintenance.

View Definition:

CREATE VIEW CSstudents AS
SELECT sid, sname, credits
FROM Student
WHERE major = 'Computer Science';

CREATE VIEW StudentGrades AS
SELECT s.sid, s.sname, c.cid, c.ctitle, e.grade
FROM Student s, Enrollment e, Course c
WHERE s.sid = e.sid AND e.cid = c.cid;

Advantages of Views:

Security: Hide sensitive columns (e.g., salary)
Simplicity: Present simplified, relevant data to users
Consistency: Provide uniform interface despite schema changes
Integrity: Enforce additional constraints through WITH CHECK OPTION

Updatable Views:

Not all views support updates. Conditions for updatability:

View based on single base table
View includes all NOT NULL columns without defaults
No aggregation, DISTINCT, GROUP BY, or set operations
No subqueries in SELECT list

WITH CHECK OPTION:

Ensures modifications through view maintain view membership:

CREATE VIEW SeniorStudents AS
SELECT * FROM Student WHERE classification = 'Senior'
WITH CHECK OPTION;
-- Prevents updating classification to something other than 'Senior'

3.9 Null Values and Three-Valued Logic

Null represents missing or unknown information—a concept that complicates relational querying.

Interpretations of Null:

Value unknown: Employee's birth date not recorded
Value not applicable: Car's number of cylinders for a bicycle
Value withheld: Employee's salary confidential

Three-Valued Logic:

SQL uses three truth values: TRUE, FALSE, UNKNOWN.

Truth Tables:

AND:

TRUE AND UNKNOWN = UNKNOWN
FALSE AND UNKNOWN = FALSE
UNKNOWN AND UNKNOWN = UNKNOWN

OR:

TRUE OR UNKNOWN = TRUE
FALSE OR UNKNOWN = UNKNOWN
UNKNOWN OR UNKNOWN = UNKNOWN

NOT:

NOT UNKNOWN = UNKNOWN

Comparison with Null:

Any comparison involving null yields UNKNOWN:

5 = NULL → UNKNOWN
NULL = NULL → UNKNOWN
NULL IS NULL → TRUE (special test)

Query Behavior:

WHERE clause returns tuples where condition is TRUE (UNKNOWN tuples excluded).

Example:

SELECT * FROM Employee WHERE salary > 50000;
-- Employees with null salary are NOT returned

Handling Nulls:

Special operators and functions:

IS NULL / IS NOT NULL
COALESCE(value1, value2, ...) returns first non-null
NULLIF(expr1, expr2) returns null if equal
NVL/IFNULL (DBMS-specific)

Impact on Aggregates:

COUNT(*) includes all rows
COUNT(column) excludes nulls
SUM, AVG ignore nulls
GROUP BY treats nulls as a single group

PART II — SQL and Advanced Querying

Chapter 4 — SQL Fundamentals

4.1 History and Standardization

SQL (Structured Query Language) originated from IBM's System R project in the 1970s, based on Codd's relational model. Originally called SEQUEL (Structured English Query Language), the name was shortened to SQL due to trademark issues.

Evolution of SQL Standards:

SQL-86 (SQL1): First ANSI standard, establishing basic DDL and DML.

SQL-89 (SQL1.5): Minor revisions, added integrity constraints.

SQL-92 (SQL2): Major revision, substantial expansion:

New data types (DATE, TIME, TIMESTAMP)
Enhanced joins (NATURAL JOIN, OUTER JOIN)
Set operations (UNION, INTERSECT, EXCEPT)
Schema manipulation statements
Dynamic SQL

SQL:1999 (SQL3): Major extensions:

Object-relational features (UDT, references)
Recursive queries (WITH RECURSIVE)
Triggers
OLAP functions
Java integration

SQL:2003: Introduced:

XML-related features
Window functions
Sequences
Identity columns
MERGE statement

SQL:2006: Enhanced XML support.

SQL:2008: Added:

TRUNCATE statement
FETCH first clause
Compound statements

SQL:2011: Added:

Temporal databases (periods, time-travel queries)
Enhanced window functions

SQL:2016: Added:

JSON support
Polymorphic table functions
Row pattern matching

4.2 Data Definition Language (DDL)

DDL defines and modifies database structures.

CREATE Statements:

Create Database:

CREATE DATABASE university
    CHARACTER SET utf8mb4
    COLLATE utf8mb4_unicode_ci;

Create Table:

CREATE TABLE students (
    student_id INTEGER AUTO_INCREMENT,
    student_name VARCHAR(100) NOT NULL,
    date_of_birth DATE,
    email VARCHAR(200) UNIQUE,
    enrollment_date DATE DEFAULT CURRENT_DATE,
    major_code CHAR(4),
    credits_completed INTEGER DEFAULT 0 CHECK (credits_completed >= 0),
    status ENUM('active', 'graduated', 'suspended') DEFAULT 'active',
    PRIMARY KEY (student_id),
    FOREIGN KEY (major_code) REFERENCES departments(dept_code)
) ENGINE=InnoDB;

Create Schema:

CREATE SCHEMA registration AUTHORIZATION dbadmin;

ALTER Statements:

Add Column:

ALTER TABLE students
ADD COLUMN phone_number VARCHAR(20) AFTER email;

Modify Column:

ALTER TABLE students
MODIFY COLUMN student_name VARCHAR(150) NOT NULL;

Drop Column:

ALTER TABLE students
DROP COLUMN phone_number;

Add Constraint:

ALTER TABLE students
ADD CONSTRAINT chk_credits CHECK (credits_completed <= 150);

DROP Statements:

DROP TABLE students;  -- Removes table and data
DROP TABLE students CASCADE CONSTRAINTS;  -- Drops dependent objects
DROP DATABASE university;
DROP SCHEMA registration RESTRICT;  -- Fails if schema contains objects

TRUNCATE:

TRUNCATE TABLE temp_enrollments;  -- Fast removal of all rows

4.3 Data Manipulation Language (DML)

DML manipulates data within database structures.

INSERT:

Insert single row:

INSERT INTO students (student_id, student_name, email, major_code)
VALUES (1001, 'Alice Johnson', 'alice@university.edu', 'CS');

Insert with defaults:

INSERT INTO students (student_name, email, major_code)
VALUES ('Bob Smith', 'bob@university.edu', 'MATH');
-- student_id auto-generates, enrollment_date defaults to today

Insert multiple rows:

INSERT INTO students (student_name, email, major_code) VALUES
    ('Carol White', 'carol@university.edu', 'PHYS'),
    ('David Brown', 'david@university.edu', 'CS'),
    ('Eva Green', 'eva@university.edu', 'MATH');

Insert from query:

INSERT INTO honors_students (student_id, student_name, gpa)
SELECT s.student_id, s.student_name, 
       AVG(e.grade_points) as gpa
FROM students s
JOIN enrollments e ON s.student_id = e.student_id
GROUP BY s.student_id, s.student_name
HAVING AVG(e.grade_points) >= 3.5;

UPDATE:

Update all rows:

UPDATE students
SET status = 'graduated'
WHERE student_id IN (SELECT student_id FROM graduates_2023);

Update with expression:

UPDATE products
SET price = price * 1.10
WHERE category = 'electronics';

Update with joins:

UPDATE students s
JOIN enrollments e ON s.student_id = e.student_id
SET s.credits_completed = s.credits_completed + e.credits_earned
WHERE e.semester = '202401' AND e.grade != 'F';

DELETE:

Delete specific rows:

DELETE FROM students
WHERE status = 'graduated' AND graduation_date < '2020-01-01';

Delete all rows:

DELETE FROM temp_log;  -- Slow for large tables
TRUNCATE TABLE temp_log;  -- Fast, but cannot rollback

MERGE (Upsert):

MERGE INTO products_target t
USING products_source s ON t.product_id = s.product_id
WHEN MATCHED THEN
    UPDATE SET t.price = s.price, t.stock = s.stock
WHEN NOT MATCHED THEN
    INSERT (product_id, product_name, price, stock)
    VALUES (s.product_id, s.product_name, s.price, s.stock);

4.4 Data Control Language (DCL)

DCL manages permissions and access control.

GRANT:

-- Grant specific privileges
GRANT SELECT, INSERT ON students TO registrar_user;

-- Grant all privileges
GRANT ALL PRIVILEGES ON university.* TO dbadmin@localhost;

-- Grant with grant option (can pass on privileges)
GRANT SELECT ON courses TO instructor WITH GRANT OPTION;

-- Grant at column level
GRANT UPDATE (grade) ON enrollments TO instructors;

REVOKE:

-- Revoke specific privileges
REVOKE INSERT ON students FROM registrar_user;

-- Revoke all privileges
REVOKE ALL PRIVILEGES ON university.* FROM dbadmin@localhost;

-- Revoke with cascade/restrict
REVOKE SELECT ON courses FROM instructor CASCADE;

4.5 Basic Queries (SELECT)

The SELECT statement forms the foundation of data retrieval.

Basic Syntax:

SELECT [DISTINCT] select_list
FROM table_list
[WHERE search_condition]
[GROUP BY group_by_list]
[HAVING search_condition]
[ORDER BY order_by_list]
[LIMIT row_count];

Simple SELECT:

SELECT * FROM students;  -- All columns

SELECT student_name, email FROM students;  -- Specific columns

SELECT DISTINCT major_code FROM students;  -- Unique values

SELECT student_name AS name, email AS contact  -- Column aliases
FROM students;

SELECT with Expressions:

SELECT student_name, 
       TIMESTAMPDIFF(YEAR, date_of_birth, CURDATE()) AS age,
       CONCAT(last_name, ', ', first_name) AS full_name
FROM students;

SELECT with Literals:

SELECT student_name, 
       'Active' AS status,
       CURRENT_DATE AS query_date
FROM students
WHERE status = 'active';

4.6 WHERE Clause and Conditions

WHERE clause filters rows based on conditions.

Comparison Operators:

-- Numeric comparisons
SELECT * FROM products WHERE price > 100;
SELECT * FROM products WHERE price BETWEEN 50 AND 100;
SELECT * FROM products WHERE price IN (19.99, 29.99, 39.99);

-- String comparisons
SELECT * FROM students WHERE student_name LIKE 'A%';  -- Starts with A
SELECT * FROM students WHERE student_name LIKE '%son';  -- Ends with 'son'
SELECT * FROM students WHERE email LIKE '%@university.edu';

-- Date comparisons
SELECT * FROM orders WHERE order_date >= '2024-01-01';
SELECT * FROM students WHERE YEAR(enrollment_date) = 2024;

Logical Operators:

-- AND: both conditions true
SELECT * FROM students 
WHERE major_code = 'CS' AND credits_completed > 60;

-- OR: either condition true
SELECT * FROM students 
WHERE major_code = 'CS' OR major_code = 'MATH';

-- NOT: negate condition
SELECT * FROM students 
WHERE NOT (status = 'graduated');

NULL Handling:

-- Check for NULL
SELECT * FROM students WHERE graduation_date IS NULL;

-- Check for NOT NULL
SELECT * FROM students WHERE phone_number IS NOT NULL;

-- COALESCE for default values
SELECT student_name, 
       COALESCE(phone_number, 'No phone') AS contact
FROM students;

4.7 ORDER BY and GROUP BY

ORDER BY (Sorting):

-- Ascending order (default)
SELECT student_name, credits_completed
FROM students
ORDER BY credits_completed;

-- Descending order
SELECT student_name, credits_completed
FROM students
ORDER BY credits_completed DESC;

-- Multiple sort columns
SELECT student_name, major_code, credits_completed
FROM students
ORDER BY major_code ASC, credits_completed DESC;

-- Sorting with expressions
SELECT student_name, TIMESTAMPDIFF(YEAR, date_of_birth, CURDATE()) AS age
FROM students
ORDER BY age DESC;

GROUP BY (Aggregation):

-- Basic grouping
SELECT major_code, COUNT(*) AS student_count
FROM students
GROUP BY major_code;

-- Multiple grouping columns
SELECT major_code, status, COUNT(*) AS count
FROM students
GROUP BY major_code, status;

-- Group by expression
SELECT YEAR(enrollment_date) AS enrollment_year,
       COUNT(*) AS new_students
FROM students
GROUP BY YEAR(enrollment_date);

4.8 HAVING Clause

HAVING filters groups (similar to WHERE filtering rows).

-- Filter groups
SELECT major_code, AVG(credits_completed) AS avg_credits
FROM students
GROUP BY major_code
HAVING AVG(credits_completed) > 30;

-- Multiple conditions
SELECT major_code, 
       COUNT(*) AS student_count,
       AVG(credits_completed) AS avg_credits
FROM students
GROUP BY major_code
HAVING COUNT(*) > 10 AND AVG(credits_completed) BETWEEN 30 AND 60;

-- HAVING vs WHERE
SELECT major_code, COUNT(*) AS student_count
FROM students
WHERE status = 'active'  -- Filter rows before grouping
GROUP BY major_code
HAVING COUNT(*) > 5;      -- Filter groups after aggregation

4.9 Built-in Functions

Aggregate Functions:

-- COUNT: number of rows
SELECT COUNT(*) FROM students;  -- Total rows
SELECT COUNT(email) FROM students;  -- Non-null emails
SELECT COUNT(DISTINCT major_code) FROM students;  -- Unique majors

-- SUM: total of numeric column
SELECT SUM(credits_completed) FROM students;

-- AVG: average of numeric column
SELECT AVG(credits_completed) FROM students;

-- MIN/MAX: minimum/maximum values
SELECT MIN(credits_completed), MAX(credits_completed) FROM students;

-- Multiple aggregates
SELECT 
    COUNT(*) AS total_students,
    AVG(credits_completed) AS avg_credits,
    MIN(credits_completed) AS min_credits,
    MAX(credits_completed) AS max_credits
FROM students;

String Functions:

-- Concatenation
SELECT CONCAT(first_name, ' ', last_name) AS full_name FROM students;

-- Substring
SELECT SUBSTRING(email, 1, LOCATE('@', email) - 1) AS username FROM students;

-- Case conversion
SELECT UPPER(last_name), LOWER(first_name) FROM students;

-- Length
SELECT student_name, LENGTH(student_name) AS name_length FROM students;

-- Trimming
SELECT TRIM(' ' FROM student_name) FROM students;

-- Replacement
SELECT REPLACE(email, '@university.edu', '@alumni.edu') FROM students;

Date/Time Functions:

-- Current date/time
SELECT CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP;

-- Date parts
SELECT 
    student_name,
    YEAR(date_of_birth) AS birth_year,
    MONTH(date_of_birth) AS birth_month,
    DAY(date_of_birth) AS birth_day
FROM students;

-- Date arithmetic
SELECT 
    student_name,
    DATEDIFF(CURRENT_DATE, enrollment_date) AS days_enrolled,
    DATE_ADD(enrollment_date, INTERVAL 4 YEAR) AS expected_graduation
FROM students;

-- Date formatting
SELECT 
    student_name,
    DATE_FORMAT(enrollment_date, '%M %d, %Y') AS formatted_date
FROM students;

Numeric Functions:

-- Rounding
SELECT price, ROUND(price, 2) FROM products;
SELECT price, CEILING(price), FLOOR(price) FROM products;

-- Absolute value
SELECT ABS(balance) FROM accounts;

-- Random
SELECT RAND() FROM DUAL;  -- Random number between 0 and 1

-- Mathematical functions
SELECT 
    price,
    POWER(price, 2) AS price_squared,
    SQRT(price) AS price_sqrt
FROM products;

Conditional Functions:

-- CASE statement
SELECT 
    student_name,
    credits_completed,
    CASE 
        WHEN credits_completed >= 120 THEN 'Senior'
        WHEN credits_completed >= 90 THEN 'Junior'
        WHEN credits_completed >= 60 THEN 'Sophomore'
        WHEN credits_completed >= 30 THEN 'Freshman'
        ELSE 'First Term'
    END AS class_level
FROM students;

-- IF function (MySQL)
SELECT 
    student_name,
    IF(status = 'active', 'Currently enrolled', 'Not active') AS enrollment_status
FROM students;

-- COALESCE (first non-null)
SELECT 
    student_name,
    COALESCE(phone_number, email, 'No contact') AS contact
FROM students;

-- NULLIF (null if equal)
SELECT 
    student_name,
    NULLIF(status, 'active') AS non_active_status
FROM students;

4.10 Constraints in SQL

Column-Level Constraints:

CREATE TABLE employees (
    emp_id INTEGER PRIMARY KEY,  -- Primary key
    ssn CHAR(11) UNIQUE,         -- Unique constraint
    first_name VARCHAR(50) NOT NULL,  -- Not null
    last_name VARCHAR(50) NOT NULL,
    hire_date DATE DEFAULT CURRENT_DATE,  -- Default value
    salary DECIMAL(10,2) CHECK (salary > 0),  -- Check constraint
    dept_id INTEGER REFERENCES departments(dept_id)  -- Foreign key
);

Table-Level Constraints:

CREATE TABLE enrollments (
    student_id INTEGER,
    course_id CHAR(8),
    semester CHAR(6),
    grade CHAR(2),
    enrollment_date DATE DEFAULT CURRENT_DATE,
    -- Table-level primary key (composite)
    CONSTRAINT pk_enrollment PRIMARY KEY (student_id, course_id, semester),
    -- Table-level foreign keys
    CONSTRAINT fk_enroll_student FOREIGN KEY (student_id) 
        REFERENCES students(student_id) ON DELETE CASCADE,
    CONSTRAINT fk_enroll_course FOREIGN KEY (course_id) 
        REFERENCES courses(course_id) ON DELETE RESTRICT,
    -- Table-level check constraint
    CONSTRAINT chk_grade CHECK (grade IN ('A', 'B', 'C', 'D', 'F', 'I'))
);

Adding Constraints to Existing Tables:

-- Primary key
ALTER TABLE employees ADD CONSTRAINT pk_emp PRIMARY KEY (emp_id);

-- Foreign key
ALTER TABLE employees ADD CONSTRAINT fk_emp_dept 
    FOREIGN KEY (dept_id) REFERENCES departments(dept_id);

-- Unique constraint
ALTER TABLE employees ADD CONSTRAINT uk_emp_ssn UNIQUE (ssn);

-- Check constraint
ALTER TABLE employees ADD CONSTRAINT chk_salary CHECK (salary >= 0);

-- Not null (special case)
ALTER TABLE employees MODIFY first_name VARCHAR(50) NOT NULL;

Dropping Constraints:

ALTER TABLE employees DROP PRIMARY KEY;
ALTER TABLE employees DROP FOREIGN KEY fk_emp_dept;
ALTER TABLE employees DROP CONSTRAINT chk_salary;
ALTER TABLE employees MODIFY first_name VARCHAR(50) NULL;  -- Drop not null

Chapter 5 — Advanced SQL

5.1 Subqueries (Correlated & Nested)

Subqueries are queries nested within other queries, providing powerful data retrieval capabilities.

Scalar Subqueries (Return Single Value):

-- Subquery in SELECT clause
SELECT 
    student_name,
    (SELECT AVG(grade_points) 
     FROM enrollments e 
     WHERE e.student_id = s.student_id) AS avg_grade
FROM students s;

-- Subquery in WHERE clause
SELECT student_name, major_code
FROM students
WHERE credits_completed > (SELECT AVG(credits_completed) FROM students);

-- Subquery in HAVING clause
SELECT major_code, AVG(credits_completed) AS avg_credits
FROM students
GROUP BY major_code
HAVING AVG(credits_completed) > (SELECT AVG(credits_completed) FROM students);

Row Subqueries (Return Single Row):

-- Compare with multiple columns
SELECT student_name, major_code, credits_completed
FROM students
WHERE (major_code, credits_completed) = 
      (SELECT major_code, MAX(credits_completed)
       FROM students
       GROUP BY major_code
       LIMIT 1);

Table Subqueries (Return Multiple Rows):

-- Subquery in FROM clause (derived table)
SELECT dept_name, avg_credits
FROM departments d
JOIN (
    SELECT major_code, AVG(credits_completed) AS avg_credits
    FROM students
    GROUP BY major_code
) dept_stats ON d.dept_code = dept_stats.major_code;

-- Subquery with IN
SELECT student_name
FROM students
WHERE student_id IN (
    SELECT DISTINCT student_id
    FROM enrollments
    WHERE grade = 'A'
);

-- Subquery with EXISTS
SELECT d.dept_name
FROM departments d
WHERE EXISTS (
    SELECT 1
    FROM courses c
    WHERE c.dept_code = d.dept_code
    AND c.credits > 3
);

Correlated Subqueries:

Correlated subqueries reference columns from the outer query, executing once for each outer row.

-- Find students with above-average credits in their major
SELECT s1.student_name, s1.major_code, s1.credits_completed
FROM students s1
WHERE s1.credits_completed > (
    SELECT AVG(s2.credits_completed)
    FROM students s2
    WHERE s2.major_code = s1.major_code
);

-- Find courses with above-average enrollment
SELECT c.course_id, c.title,
       (SELECT COUNT(*) 
        FROM enrollments e 
        WHERE e.course_id = c.course_id) AS enrollment_count
FROM courses c
WHERE (
    SELECT COUNT(*) 
    FROM enrollments e 
    WHERE e.course_id = c.course_id
) > (
    SELECT AVG(enrollment_count)
    FROM (
        SELECT COUNT(*) AS enrollment_count
        FROM enrollments
        GROUP BY course_id
    ) avg_enrollments
);

Subquery Operators:

-- ALL: compare with all values
SELECT student_name, credits_completed
FROM students
WHERE credits_completed > ALL (
    SELECT credits_completed
    FROM students
    WHERE major_code = 'MATH'
);

-- ANY/SOME: compare with any value
SELECT student_name, credits_completed
FROM students
WHERE credits_completed > ANY (
    SELECT credits_completed
    FROM students
    WHERE major_code = 'MATH'
);

-- EXISTS: check for existence
SELECT d.dept_name
FROM departments d
WHERE EXISTS (
    SELECT 1
    FROM faculty f
    WHERE f.dept_code = d.dept_code
    AND f.rank = 'Professor'
    AND f.salary > 150000
);

5.2 Joins (Inner, Outer, Self, Cross)

Joins combine rows from multiple tables based on related columns.

INNER JOIN:

Returns only rows with matches in both tables.

-- Basic inner join
SELECT s.student_name, c.title, e.grade
FROM students s
INNER JOIN enrollments e ON s.student_id = e.student_id
INNER JOIN courses c ON e.course_id = c.course_id;

-- Using USING when column names match
SELECT student_name, title, grade
FROM students
JOIN enrollments USING (student_id)
JOIN courses USING (course_id);

-- Multiple join conditions
SELECT e1.student_id, e1.course_id, e1.grade, e2.grade AS prev_grade
FROM enrollments e1
JOIN enrollments e2 ON e1.student_id = e2.student_id
    AND e1.course_id = e2.course_id
    AND e1.semester > e2.semester;

OUTER JOIN:

Preserves rows even when no match exists.

-- LEFT OUTER JOIN: all rows from left table
SELECT s.student_name, e.course_id, e.grade
FROM students s
LEFT JOIN enrollments e ON s.student_id = e.student_id;

-- RIGHT OUTER JOIN: all rows from right table
SELECT s.student_name, e.course_id, e.grade
FROM enrollments e
RIGHT JOIN students s ON e.student_id = s.student_id;

-- FULL OUTER JOIN: all rows from both tables
SELECT s.student_name, e.course_id, e.grade
FROM students s
FULL OUTER JOIN enrollments e ON s.student_id = e.student_id;

SELF JOIN:

Joining a table with itself.

-- Find employees and their managers
SELECT e1.emp_name AS employee,
       e2.emp_name AS manager
FROM employees e1
LEFT JOIN employees e2 ON e1.manager_id = e2.emp_id;

-- Find pairs of students in same major
SELECT s1.student_name AS student1,
       s2.student_name AS student2,
       s1.major_code
FROM students s1
JOIN students s2 ON s1.major_code = s2.major_code
    AND s1.student_id < s2.student_id;

CROSS JOIN:

Cartesian product of all rows.

-- All possible combinations
SELECT s.student_name, c.title
FROM students s
CROSS JOIN courses c;

-- Generate date range
SELECT dates.date, COUNT(e.enrollment_id)
FROM (
    SELECT DATE_ADD('2024-01-01', INTERVAL n DAY) AS date
    FROM numbers  -- Table with numbers 0 to 365
    WHERE DATE_ADD('2024-01-01', INTERVAL n DAY) <= '2024-12-31'
) dates
LEFT JOIN enrollments e ON DATE(e.enrollment_date) = dates.date
GROUP BY dates.date;

NATURAL JOIN:

Automatically joins on columns with same names (use cautiously).

-- Automatically joins on common column names
SELECT student_name, title, grade
FROM students
NATURAL JOIN enrollments
NATURAL JOIN courses;

5.3 Set Operations (UNION, INTERSECT, EXCEPT)

Set operations combine results from multiple queries.

UNION and UNION ALL:

-- UNION removes duplicates
SELECT student_id, student_name FROM students
WHERE major_code = 'CS'
UNION
SELECT student_id, student_name FROM students
WHERE credits_completed > 100;

-- UNION ALL preserves duplicates (faster)
SELECT student_id FROM enrollments WHERE semester = '202401'
UNION ALL
SELECT student_id FROM enrollments WHERE semester = '202402';

-- Combining different tables
SELECT 'Student' AS type, student_name AS name FROM students
UNION
SELECT 'Faculty' AS type, faculty_name AS name FROM faculty
UNION
SELECT 'Staff' AS type, staff_name AS name FROM staff;

INTERSECT:

Returns rows common to both queries.

-- Students who took both CS101 and MATH201
SELECT student_id FROM enrollments WHERE course_id = 'CS101'
INTERSECT
SELECT student_id FROM enrollments WHERE course_id = 'MATH201';

-- Equivalent using EXISTS
SELECT DISTINCT student_id
FROM enrollments e1
WHERE course_id = 'CS101'
  AND EXISTS (
      SELECT 1 FROM enrollments e2
      WHERE e2.student_id = e1.student_id
        AND e2.course_id = 'MATH201'
  );

EXCEPT (or MINUS):

Returns rows from first query not in second.

-- Students who took CS101 but not MATH201
SELECT student_id FROM enrollments WHERE course_id = 'CS101'
EXCEPT
SELECT student_id FROM enrollments WHERE course_id = 'MATH201';

-- Courses with no enrollments
SELECT course_id FROM courses
EXCEPT
SELECT DISTINCT course_id FROM enrollments;

5.4 Window Functions

Window functions perform calculations across sets of rows related to the current row.

ROW_NUMBER, RANK, DENSE_RANK:

-- Assign row numbers
SELECT 
    student_name,
    credits_completed,
    ROW_NUMBER() OVER (ORDER BY credits_completed DESC) AS row_num,
    RANK() OVER (ORDER BY credits_completed DESC) AS rank,
    DENSE_RANK() OVER (ORDER BY credits_completed DESC) AS dense_rank
FROM students;

-- Partition by major
SELECT 
    student_name,
    major_code,
    credits_completed,
    ROW_NUMBER() OVER (PARTITION BY major_code 
                       ORDER BY credits_completed DESC) AS rank_in_major
FROM students;

LAG and LEAD:

Access previous/next row values.

-- Compare with previous enrollment
SELECT 
    student_id,
    semester,
    grade_points,
    LAG(grade_points, 1) OVER (PARTITION BY student_id 
                                ORDER BY semester) AS prev_gpa,
    grade_points - LAG(grade_points, 1, grade_points) 
        OVER (PARTITION BY student_id ORDER BY semester) AS gpa_change
FROM student_gpa;

-- LEAD for next value
SELECT 
    employee_id,
    salary,
    effective_date,
    LEAD(effective_date, 1, '9999-12-31') 
        OVER (PARTITION BY employee_id ORDER BY effective_date) AS next_date
FROM salary_history;

Aggregate Window Functions:

-- Running totals
SELECT 
    order_date,
    amount,
    SUM(amount) OVER (ORDER BY order_date) AS running_total,
    AVG(amount) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg
FROM orders
WHERE customer_id = 123;

-- Percentage of total
SELECT 
    department,
    employee_name,
    salary,
    salary / SUM(salary) OVER (PARTITION BY department) * 100 AS pct_of_dept
FROM employees;

FIRST_VALUE and LAST_VALUE:

-- First and last in group
SELECT 
    department,
    employee_name,
    salary,
    FIRST_VALUE(employee_name) OVER (PARTITION BY department 
                                      ORDER BY salary DESC) AS highest_paid,
    LAST_VALUE(employee_name) OVER (PARTITION BY department 
                                     ORDER BY salary DESC
                                     ROWS BETWEEN UNBOUNDED PRECEDING 
                                     AND UNBOUNDED FOLLOWING) AS lowest_paid
FROM employees;

NTILE:

Divide rows into buckets.

-- Divide students into quartiles by credits
SELECT 
    student_name,
    credits_completed,
    NTILE(4) OVER (ORDER BY credits_completed) AS quartile
FROM students;

-- Deciles by salary
SELECT 
    department,
    employee_name,
    salary,
    NTILE(10) OVER (PARTITION BY department ORDER BY salary) AS salary_decile
FROM employees;

5.5 Common Table Expressions (CTE)

CTEs provide temporary result sets within a query.

Basic CTE:

WITH dept_stats AS (
    SELECT 
        major_code,
        COUNT(*) AS student_count,
        AVG(credits_completed) AS avg_credits
    FROM students
    GROUP BY major_code
)
SELECT d.dept_name, ds.student_count, ds.avg_credits
FROM departments d
JOIN dept_stats ds ON d.dept_code = ds.major_code;

Multiple CTEs:

WITH 
high_achievers AS (
    SELECT student_id, AVG(grade_points) AS gpa
    FROM enrollments
    GROUP BY student_id
    HAVING AVG(grade_points) > 3.5
),
active_students AS (
    SELECT student_id, student_name
    FROM students
    WHERE status = 'active'
)
SELECT a.student_name, h.gpa
FROM active_students a
JOIN high_achievers h ON a.student_id = h.student_id;

CTE for Hierarchical Data:

See Recursive Queries section (5.6).

5.6 Recursive Queries

Recursive CTEs enable querying hierarchical or graph-structured data.

Employee Hierarchy:

WITH RECURSIVE emp_hierarchy AS (
    -- Anchor member: top-level employees
    SELECT 
        emp_id,
        emp_name,
        manager_id,
        1 AS level,
        CAST(emp_name AS CHAR(1000)) AS path
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive member
    SELECT 
        e.emp_id,
        e.emp_name,
        e.manager_id,
        h.level + 1,
        CONCAT(h.path, ' -> ', e.emp_name)
    FROM employees e
    JOIN emp_hierarchy h ON e.manager_id = h.emp_id
)
SELECT * FROM emp_hierarchy
ORDER BY path;

Bill of Materials:

WITH RECURSIVE bom AS (
    -- Base components
    SELECT 
        part_id,
        part_name,
        quantity,
        part_id AS root_part,
        1 AS level
    FROM parts
    WHERE parent_id IS NULL
    
    UNION ALL
    
    -- Subcomponents
    SELECT 
        p.part_id,
        p.part_name,
        p.quantity * b.quantity,
        b.root_part,
        b.level + 1
    FROM parts p
    JOIN bom b ON p.parent_id = b.part_id
)
SELECT 
    root_part,
    part_id,
    part_name,
    quantity,
    level
FROM bom
ORDER BY root_part, level;

Graph Traversal:

-- Find all paths in a graph
WITH RECURSIVE paths AS (
    -- Direct connections
    SELECT 
        node_from,
        node_to,
        CAST(node_from AS CHAR(1000)) AS path,
        1 AS length
    FROM edges
    
    UNION ALL
    
    -- Extended paths
    SELECT 
        p.node_from,
        e.node_to,
        CONCAT(p.path, ',', e.node_to),
        p.length + 1
    FROM paths p
    JOIN edges e ON p.node_to = e.node_from
    WHERE NOT FIND_IN_SET(e.node_to, REPLACE(p.path, ',', ','))
)
SELECT * FROM paths;

5.7 Stored Procedures

Stored procedures encapsulate business logic within the database.

Basic Procedure:

DELIMITER //

CREATE PROCEDURE GetStudentInfo(IN student_id_param INT)
BEGIN
    SELECT s.student_name, s.major_code, s.credits_completed,
           COUNT(e.course_id) AS courses_taken,
           AVG(e.grade_points) AS gpa
    FROM students s
    LEFT JOIN enrollments e ON s.student_id = e.student_id
    WHERE s.student_id = student_id_param
    GROUP BY s.student_id;
END//

DELIMITER ;

-- Call procedure
CALL GetStudentInfo(1001);

Procedure with OUT Parameters:

DELIMITER //

CREATE PROCEDURE CalculateDepartmentStats(
    IN dept_code_param CHAR(4),
    OUT student_count INT,
    OUT avg_credits DECIMAL(5,2),
    OUT max_credits INT
)
BEGIN
    SELECT COUNT(*), AVG(credits_completed), MAX(credits_completed)
    INTO student_count, avg_credits, max_credits
    FROM students
    WHERE major_code = dept_code_param;
END//

DELIMITER ;

-- Call with OUT parameters
CALL CalculateDepartmentStats('CS', @count, @avg, @max);
SELECT @count, @avg, @max;

Procedure with Condition Logic:

DELIMITER //

CREATE PROCEDURE EnrollStudent(
    IN p_student_id INT,
    IN p_course_id CHAR(8),
    IN p_semester CHAR(6)
)
BEGIN
    DECLARE v_enrollment_count INT;
    DECLARE v_course_capacity INT;
    DECLARE v_current_enrollments INT;
    
    -- Start transaction
    START TRANSACTION;
    
    -- Check if already enrolled
    SELECT COUNT(*) INTO v_enrollment_count
    FROM enrollments
    WHERE student_id = p_student_id 
      AND course_id = p_course_id 
      AND semester = p_semester;
    
    IF v_enrollment_count > 0 THEN
        ROLLBACK;
        SIGNAL SQLSTATE '45000' 
        SET MESSAGE_TEXT = 'Student already enrolled in this course';
    END IF;
    
    -- Check course capacity
    SELECT capacity INTO v_course_capacity
    FROM course_sections
    WHERE course_id = p_course_id AND semester = p_semester;
    
    SELECT COUNT(*) INTO v_current_enrollments
    FROM enrollments
    WHERE course_id = p_course_id AND semester = p_semester;
    
    IF v_current_enrollments >= v_course_capacity THEN
        ROLLBACK;
        SIGNAL SQLSTATE '45000' 
        SET MESSAGE_TEXT = 'Course section is full';
    END IF;
    
    -- Enroll student
    INSERT INTO enrollments (student_id, course_id, semester, enrollment_date)
    VALUES (p_student_id, p_course_id, p_semester, CURRENT_DATE);
    
    COMMIT;
END//

DELIMITER ;

5.8 Triggers

Triggers automatically execute in response to database events.

BEFORE INSERT Trigger:

DELIMITER //

CREATE TRIGGER validate_enrollment
BEFORE INSERT ON enrollments
FOR EACH ROW
BEGIN
    DECLARE v_student_status VARCHAR(20);
    DECLARE v_prereq_complete BOOLEAN;
    
    -- Check student status
    SELECT status INTO v_student_status
    FROM students
    WHERE student_id = NEW.student_id;
    
    IF v_student_status != 'active' THEN
        SIGNAL SQLSTATE '45000' 
        SET MESSAGE_TEXT = 'Student is not active';
    END IF;
    
    -- Check prerequisites (simplified)
    -- Additional validation logic here
END//

DELIMITER ;

AFTER UPDATE Trigger:

DELIMITER //

CREATE TRIGGER update_student_gpa
AFTER UPDATE ON enrollments
FOR EACH ROW
BEGIN
    DECLARE v_new_gpa DECIMAL(3,2);
    
    -- Recalculate student's GPA
    SELECT AVG(grade_points) INTO v_new_gpa
    FROM enrollments
    WHERE student_id = NEW.student_id
      AND grade IS NOT NULL;
    
    -- Update student record
    UPDATE students
    SET gpa = v_new_gpa
    WHERE student_id = NEW.student_id;
END//

DELIMITER ;

Triggers for Auditing:

DELIMITER //

CREATE TRIGGER audit_student_changes
AFTER UPDATE ON students
FOR EACH ROW
BEGIN
    INSERT INTO audit_log (
        table_name,
        record_id,
        action_type,
        old_value,
        new_value,
        changed_by,
        change_date
    ) VALUES (
        'students',
        OLD.student_id,
        'UPDATE',
        JSON_OBJECT(
            'name', OLD.student_name,
            'major', OLD.major_code,
            'status', OLD.status
        ),
        JSON_OBJECT(
            'name', NEW.student_name,
            'major', NEW.major_code,
            'status', NEW.status
        ),
        USER(),
        NOW()
    );
END//

DELIMITER ;

5.9 Index Creation and Usage

Indexes dramatically improve query performance.

Basic Index Creation:

-- Single column index
CREATE INDEX idx_student_name ON students(student_name);

-- Unique index
CREATE UNIQUE INDEX idx_student_email ON students(email);

-- Composite index
CREATE INDEX idx_enrollments_lookup ON enrollments(student_id, course_id, semester);

-- Prefix index (for strings)
CREATE INDEX idx_course_title ON courses(title(20));

Functional Indexes:

-- Index on expression (PostgreSQL)
CREATE INDEX idx_student_lower_email ON students(LOWER(email));

-- Index on function (Oracle)
CREATE INDEX idx_enrollment_year ON enrollments(EXTRACT(YEAR FROM enrollment_date));

Partial Indexes:

-- Index only active students (PostgreSQL)
CREATE INDEX idx_active_students ON students(student_id)
WHERE status = 'active';

-- Index high-value orders
CREATE INDEX idx_large_orders ON orders(order_id)
WHERE amount > 1000;

Invisible/Dropping Indexes:

-- Make index invisible (Oracle)
ALTER INDEX idx_student_name INVISIBLE;

-- Drop index
DROP INDEX idx_student_name ON students;

5.10 Query Performance Analysis

Understanding and optimizing query performance.

EXPLAIN Plans:

-- MySQL
EXPLAIN SELECT s.student_name, c.title, e.grade
FROM students s
JOIN enrollments e ON s.student_id = e.student_id
JOIN courses c ON e.course_id = c.course_id
WHERE s.major_code = 'CS';

-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM students WHERE student_name LIKE 'Smith%';

-- Oracle
EXPLAIN PLAN FOR
SELECT * FROM students WHERE credits_completed > 100;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);

Query Optimization Techniques:

-- Use EXISTS instead of IN for large subqueries
-- Slow
SELECT * FROM students
WHERE student_id IN (SELECT student_id FROM enrollments);

-- Fast
SELECT * FROM students s
WHERE EXISTS (SELECT 1 FROM enrollments e WHERE e.student_id = s.student_id);

-- Avoid functions on indexed columns
-- Bad (function on indexed column)
SELECT * FROM students WHERE YEAR(enrollment_date) = 2024;

-- Good
SELECT * FROM students 
WHERE enrollment_date BETWEEN '2024-01-01' AND '2024-12-31';

-- Use appropriate join order
-- Place most selective conditions first
SELECT s.student_name, c.title
FROM students s
JOIN enrollments e ON s.student_id = e.student_id
JOIN courses c ON e.course_id = c.course_id
WHERE s.major_code = 'CS'  -- Selective condition
  AND e.semester = '202401';

Index Usage Analysis:

-- Find unused indexes (PostgreSQL)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
ORDER BY idx_scan;

-- Index recommendations (SQL Server)
SELECT * FROM sys.dm_db_missing_index_details;

Chapter 6 — Procedural SQL Extensions

6.1 PL/SQL (Oracle)

PL/SQL (Procedural Language/SQL) is Oracle's procedural extension to SQL.

Block Structure:

DECLARE
    -- Declaration section
    v_student_name students.student_name%TYPE;
    v_enrollment_count INTEGER := 0;
    v_current_date DATE := SYSDATE;
    
    CURSOR c_students IS
        SELECT student_id, student_name FROM students WHERE status = 'active';
BEGIN
    -- Executable section
    SELECT student_name INTO v_student_name
    FROM students
    WHERE student_id = 1001;
    
    DBMS_OUTPUT.PUT_LINE('Student: ' || v_student_name);
    
    FOR student_rec IN c_students LOOP
        DBMS_OUTPUT.PUT_LINE('Processing: ' || student_rec.student_name);
    END LOOP;
EXCEPTION
    -- Exception handling section
    WHEN NO_DATA_FOUND THEN
        DBMS_OUTPUT.PUT_LINE('Student not found');
    WHEN OTHERS THEN
        DBMS_OUTPUT.PUT_LINE('Error: ' || SQLERRM);
END;
/

Stored Procedures in PL/SQL:

CREATE OR REPLACE PROCEDURE calculate_gpa (
    p_student_id IN students.student_id%TYPE,
    p_gpa OUT NUMBER
) IS
    v_total_points NUMBER := 0;
    v_total_credits NUMBER := 0;
BEGIN
    SELECT SUM(c.credits * 
        CASE e.grade
            WHEN 'A' THEN 4.0
            WHEN 'B' THEN 3.0
            WHEN 'C' THEN 2.0
            WHEN 'D' THEN 1.0
            ELSE 0
        END),
        SUM(c.credits)
    INTO v_total_points, v_total_credits
    FROM enrollments e
    JOIN courses c ON e.course_id = c.course_id
    WHERE e.student_id = p_student_id
      AND e.grade IS NOT NULL;
    
    IF v_total_credits > 0 THEN
        p_gpa := v_total_points / v_total_credits;
    ELSE
        p_gpa := 0;
    END IF;
    
EXCEPTION
    WHEN NO_DATA_FOUND THEN
        p_gpa := 0;
    WHEN OTHERS THEN
        RAISE_APPLICATION_ERROR(-20001, 'Error calculating GPA: ' || SQLERRM);
END calculate_gpa;
/

Functions in PL/SQL:

CREATE OR REPLACE FUNCTION get_student_status (
    p_student_id IN students.student_id%TYPE
) RETURN VARCHAR2 IS
    v_status students.status%TYPE;
    v_credits students.credits_completed%TYPE;
BEGIN
    SELECT status, credits_completed
    INTO v_status, v_credits
    FROM students
    WHERE student_id = p_student_id;
    
    IF v_status = 'graduated' THEN
        RETURN 'Graduated';
    ELSIF v_credits >= 120 THEN
        RETURN 'Eligible for Graduation';
    ELSIF v_credits >= 90 THEN
        RETURN 'Senior';
    ELSIF v_credits >= 60 THEN
        RETURN 'Junior';
    ELSIF v_credits >= 30 THEN
        RETURN 'Sophomore';
    ELSE
        RETURN 'Freshman';
    END IF;
EXCEPTION
    WHEN NO_DATA_FOUND THEN
        RETURN 'Not Found';
END get_student_status;
/

6.2 T-SQL (Microsoft SQL Server)

T-SQL (Transact-SQL) is Microsoft SQL Server's procedural extension.

Variables and Control Flow:

DECLARE @StudentID INT = 1001
DECLARE @StudentName NVARCHAR(100)
DECLARE @GPA DECIMAL(3,2)

-- IF-ELSE
IF EXISTS (SELECT 1 FROM students WHERE student_id = @StudentID)
BEGIN
    SELECT @StudentName = student_name 
    FROM students 
    WHERE student_id = @StudentID
    
    PRINT 'Found student: ' + @StudentName
    
    -- WHILE loop
    WHILE @StudentID < 1010
    BEGIN
        -- Process students
        SET @StudentID = @StudentID + 1
    END
END
ELSE
BEGIN
    PRINT 'Student not found'
END

-- CASE expression
SELECT 
    student_name,
    CASE 
        WHEN credits_completed >= 120 THEN 'Senior'
        WHEN credits_completed >= 90 THEN 'Junior'
        WHEN credits_completed >= 60 THEN 'Sophomore'
        WHEN credits_completed >= 30 THEN 'Freshman'
        ELSE 'First Year'
    END AS class_level
FROM students

Stored Procedures in T-SQL:

CREATE PROCEDURE sp_EnrollStudent
    @StudentID INT,
    @CourseID CHAR(8),
    @Semester CHAR(6)
AS
BEGIN
    SET NOCOUNT ON;
    
    BEGIN TRY
        BEGIN TRANSACTION
            
            -- Check if already enrolled
            IF EXISTS (
                SELECT 1 FROM enrollments 
                WHERE student_id = @StudentID 
                  AND course_id = @CourseID 
                  AND semester = @Semester
            )
            BEGIN
                RAISERROR('Student already enrolled', 16, 1)
                ROLLBACK TRANSACTION
                RETURN
            END
            
            -- Check prerequisites (simplified)
            -- Insert enrollment
            INSERT INTO enrollments (student_id, course_id, semester, enrollment_date)
            VALUES (@StudentID, @CourseID, @Semester, GETDATE())
            
            COMMIT TRANSACTION
    END TRY
    BEGIN CATCH
        IF @@TRANCOUNT > 0
            ROLLBACK TRANSACTION
            
        DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE()
        DECLARE @ErrorSeverity INT = ERROR_SEVERITY()
        DECLARE @ErrorState INT = ERROR_STATE()
        
        RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState)
    END CATCH
END
GO

6.3 Functions and Packages

User-Defined Functions:

-- Scalar function (returns single value)
CREATE FUNCTION dbo.CalculateAge
(
    @BirthDate DATE
)
RETURNS INT
AS
BEGIN
    DECLARE @Age INT
    
    SET @Age = DATEDIFF(YEAR, @BirthDate, GETDATE())
    
    -- Adjust for birthday not yet occurred this year
    IF DATEADD(YEAR, @Age, @BirthDate) > GETDATE()
        SET @Age = @Age - 1
        
    RETURN @Age
END
GO

-- Table-valued function
CREATE FUNCTION dbo.GetDepartmentStudents
(
    @DeptCode CHAR(4)
)
RETURNS TABLE
AS
RETURN
(
    SELECT student_id, student_name, credits_completed
    FROM students
    WHERE major_code = @DeptCode
      AND status = 'active'
)
GO

-- Usage
SELECT * FROM dbo.GetDepartmentStudents('CS')
WHERE credits_completed > 60

Oracle Packages:

Packages group related procedures and functions.

CREATE OR REPLACE PACKAGE student_pkg IS
    -- Public declarations
    FUNCTION calculate_gpa(p_student_id IN NUMBER) RETURN NUMBER;
    PROCEDURE enroll_student(
        p_student_id IN NUMBER,
        p_course_id IN VARCHAR2,
        p_semester IN VARCHAR2
    );
    
    -- Public constant
    c_min_credits CONSTANT NUMBER := 120;
    
    -- Public exception
    e_student_inactive EXCEPTION;
END student_pkg;
/

CREATE OR REPLACE PACKAGE BODY student_pkg IS
    -- Private variable
    v_debug_mode BOOLEAN := FALSE;
    
    -- Private function
    FUNCTION validate_student(p_student_id IN NUMBER) RETURN BOOLEAN IS
        v_status students.status%TYPE;
    BEGIN
        SELECT status INTO v_status
        FROM students
        WHERE student_id = p_student_id;
        
        RETURN v_status = 'active';
    EXCEPTION
        WHEN NO_DATA_FOUND THEN
            RETURN FALSE;
    END validate_student;
    
    -- Public function implementation
    FUNCTION calculate_gpa(p_student_id IN NUMBER) RETURN NUMBER IS
        v_gpa NUMBER;
    BEGIN
        SELECT AVG(
            CASE grade
                WHEN 'A' THEN 4.0
                WHEN 'B' THEN 3.0
                WHEN 'C' THEN 2.0
                WHEN 'D' THEN 1.0
                ELSE 0
            END
        ) INTO v_gpa
        FROM enrollments
        WHERE student_id = p_student_id
          AND grade IS NOT NULL;
          
        RETURN NVL(v_gpa, 0);
    END calculate_gpa;
    
    -- Public procedure implementation
    PROCEDURE enroll_student(
        p_student_id IN NUMBER,
        p_course_id IN VARCHAR2,
        p_semester IN VARCHAR2
    ) IS
    BEGIN
        IF NOT validate_student(p_student_id) THEN
            RAISE e_student_inactive;
        END IF;
        
        INSERT INTO enrollments (student_id, course_id, semester, enrollment_date)
        VALUES (p_student_id, p_course_id, p_semester, SYSDATE);
        
        COMMIT;
    EXCEPTION
        WHEN DUP_VAL_ON_INDEX THEN
            RAISE_APPLICATION_ERROR(-20002, 'Already enrolled');
        WHEN e_student_inactive THEN
            RAISE_APPLICATION_ERROR(-20003, 'Student not active');
    END enroll_student;
    
END student_pkg;
/

6.4 Error Handling

SQL Server Error Handling:

CREATE PROCEDURE sp_SafeTransfer
    @FromAccount INT,
    @ToAccount INT,
    @Amount DECIMAL(10,2)
AS
BEGIN
    SET NOCOUNT ON;
    SET XACT_ABORT ON;
    
    DECLARE @TranName VARCHAR(20) = 'TransferTran';
    
    BEGIN TRY
        BEGIN TRANSACTION @TranName;
        
        -- Check sufficient funds
        IF NOT EXISTS (
            SELECT 1 FROM accounts 
            WHERE account_id = @FromAccount 
            AND balance >= @Amount
        )
        BEGIN
            ;THROW 50001, 'Insufficient funds', 1;
        END
        
        UPDATE accounts SET balance = balance - @Amount
        WHERE account_id = @FromAccount;
        
        UPDATE accounts SET balance = balance + @Amount
        WHERE account_id = @ToAccount;
        
        COMMIT TRANSACTION @TranName;
    END TRY
    BEGIN CATCH
        IF @@TRANCOUNT > 0
            ROLLBACK TRANSACTION @TranName;
            
        DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE();
        DECLARE @ErrorSeverity INT = ERROR_SEVERITY();
        DECLARE @ErrorState INT = ERROR_STATE();
        
        -- Log error
        INSERT INTO error_log (error_message, error_date, procedure_name)
        VALUES (@ErrorMessage, GETDATE(), 'sp_SafeTransfer');
        
        -- Re-raise
        RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState);
    END CATCH
END

Oracle Exception Handling:

CREATE OR REPLACE PROCEDURE process_enrollments IS
    -- User-defined exception
    e_invalid_grade EXCEPTION;
    PRAGMA EXCEPTION_INIT(e_invalid_grade, -20001);
    
    -- Cursor with error handling
    CURSOR c_enrollments IS
        SELECT student_id, course_id, grade
        FROM pending_enrollments;
BEGIN
    FOR rec IN c_enrollments LOOP
        BEGIN
            -- Validate grade
            IF rec.grade NOT IN ('A','B','C','D','F') THEN
                RAISE e_invalid_grade;
            END IF;
            
            -- Insert valid enrollment
            INSERT INTO enrollments (student_id, course_id, grade)
            VALUES (rec.student_id, rec.course_id, rec.grade);
            
            DELETE FROM pending_enrollments
            WHERE student_id = rec.student_id
              AND course_id = rec.course_id;
              
        EXCEPTION
            WHEN e_invalid_grade THEN
                -- Log invalid grade and continue
                INSERT INTO error_log (message, occurred_at)
                VALUES ('Invalid grade for student ' || rec.student_id, SYSDATE);
                
            WHEN DUP_VAL_ON_INDEX THEN
                -- Handle duplicate
                UPDATE pending_enrollments
                SET error_message = 'Duplicate enrollment'
                WHERE student_id = rec.student_id
                  AND course_id = rec.course_id;
                  
            WHEN OTHERS THEN
                -- Unexpected error, log and continue
                INSERT INTO error_log (message, occurred_at)
                VALUES (SQLERRM, SYSDATE);
        END;
    END LOOP;
    
    COMMIT;
EXCEPTION
    WHEN OTHERS THEN
        ROLLBACK;
        RAISE_APPLICATION_ERROR(-20002, 'Fatal error: ' || SQLERRM);
END process_enrollments;
/

6.5 Cursors

Explicit Cursors (Oracle):

DECLARE
    CURSOR c_student_courses (p_student_id NUMBER) IS
        SELECT c.title, e.grade, c.credits
        FROM enrollments e
        JOIN courses c ON e.course_id = c.course_id
        WHERE e.student_id = p_student_id
        ORDER BY e.semester DESC;
        
    v_student_name students.student_name%TYPE;
    v_course_record c_student_courses%ROWTYPE;
BEGIN
    SELECT student_name INTO v_student_name
    FROM students
    WHERE student_id = 1001;
    
    DBMS_OUTPUT.PUT_LINE('Courses for ' || v_student_name);
    DBMS_OUTPUT.PUT_LINE('------------------------');
    
    OPEN c_student_courses(1001);
    LOOP
        FETCH c_student_courses INTO v_course_record;
        EXIT WHEN c_student_courses%NOTFOUND;
        
        DBMS_OUTPUT.PUT_LINE(
            v_course_record.title || ': ' ||
            NVL(v_course_record.grade, 'IP') || 
            ' (' || v_course_record.credits || ' credits)'
        );
    END LOOP;
    
    DBMS_OUTPUT.PUT_LINE('Total courses: ' || c_student_courses%ROWCOUNT);
    CLOSE c_student_courses;
END;
/

Cursor FOR Loops:

DECLARE
    CURSOR c_dept_summary IS
        SELECT d.dept_name,
               COUNT(DISTINCT s.student_id) AS student_count,
               COUNT(DISTINCT f.faculty_id) AS faculty_count
        FROM departments d
        LEFT JOIN students s ON d.dept_code = s.major_code
        LEFT JOIN faculty f ON d.dept_code = f.dept_code
        GROUP BY d.dept_name;
BEGIN
    FOR dept_rec IN c_dept_summary LOOP
        DBMS_OUTPUT.PUT_LINE(
            dept_rec.dept_name || ': ' ||
            dept_rec.student_count || ' students, ' ||
            dept_rec.faculty_count || ' faculty'
        );
    END LOOP;
END;
/

Cursors with Parameters (SQL Server):

DECLARE @StudentID INT = 1001
DECLARE @CourseID CHAR(8)
DECLARE @Grade CHAR(2)

DECLARE student_courses CURSOR FOR
    SELECT c.course_id, e.grade
    FROM enrollments e
    JOIN courses c ON e.course_id = c.course_id
    WHERE e.student_id = @StudentID

OPEN student_courses

FETCH NEXT FROM student_courses INTO @CourseID, @Grade

WHILE @@FETCH_STATUS = 0
BEGIN
    PRINT 'Course: ' + @CourseID + ', Grade: ' + ISNULL(@Grade, 'IP')
    
    FETCH NEXT FROM student_courses INTO @CourseID, @Grade
END

CLOSE student_courses
DEALLOCATE student_courses

6.6 Dynamic SQL

Dynamic SQL constructs and executes SQL statements at runtime.

Oracle EXECUTE IMMEDIATE:

CREATE OR REPLACE PROCEDURE search_table(
    p_table_name VARCHAR2,
    p_column_name VARCHAR2,
    p_search_value VARCHAR2
) IS
    v_sql VARCHAR2(1000);
    v_count NUMBER;
BEGIN
    -- Build dynamic SQL
    v_sql := 'SELECT COUNT(*) FROM ' || p_table_name ||
             ' WHERE ' || p_column_name || ' = :value';
    
    -- Execute with USING clause
    EXECUTE IMMEDIATE v_sql INTO v_count USING p_search_value;
    
    DBMS_OUTPUT.PUT_LINE('Found ' || v_count || ' rows');
    
    -- Dynamic cursor
    v_sql := 'SELECT * FROM ' || p_table_name ||
             ' WHERE ' || p_column_name || ' = :value';
             
    EXECUTE IMMEDIATE v_sql USING p_search_value;
    
EXCEPTION
    WHEN OTHERS THEN
        DBMS_OUTPUT.PUT_LINE('Error: ' || SQLERRM);
END search_table;
/

SQL Server sp_executesql:

CREATE PROCEDURE sp_DynamicPivot
    @SchemaName NVARCHAR(128),
    @TableName NVARCHAR(128),
    @PivotColumn NVARCHAR(128),
    @AggregateColumn NVARCHAR(128)
AS
BEGIN
    DECLARE @Columns NVARCHAR(MAX)
    DECLARE @SQL NVARCHAR(MAX)
    
    -- Build column list dynamically
    SET @Columns = STUFF(
        (SELECT DISTINCT ',' + QUOTENAME([value])
         FROM (
            SELECT DISTINCT @PivotColumn AS [value]
            FROM sys.tables t
            JOIN sys.columns c ON t.object_id = c.object_id
            WHERE t.name = @TableName
              AND SCHEMA_NAME(t.schema_id) = @SchemaName
         ) cols
         FOR XML PATH('')), 1, 1, '')
    
    -- Build dynamic pivot query
    SET @SQL = '
        SELECT *
        FROM (
            SELECT ' + @PivotColumn + ', ' + @AggregateColumn + '
            FROM ' + QUOTENAME(@SchemaName) + '.' + QUOTENAME(@TableName) + '
        ) AS SourceTable
        PIVOT (
            COUNT(' + @AggregateColumn + ')
            FOR ' + @PivotColumn + ' IN (' + @Columns + ')
        ) AS PivotTable'
    
    EXEC sp_executesql @SQL
END

Dynamic SQL for Table Maintenance:

CREATE OR REPLACE PROCEDURE archive_old_records(
    p_table_name VARCHAR2,
    p_date_column VARCHAR2,
    p_cutoff_date DATE,
    p_archive_table VARCHAR2 DEFAULT NULL
) IS
    v_archive_name VARCHAR2(100);
    v_create_sql VARCHAR2(1000);
    v_insert_sql VARCHAR2(1000);
    v_delete_sql VARCHAR2(1000);
    v_count NUMBER;
BEGIN
    -- Set archive table name
    v_archive_name := NVL(p_archive_table, p_table_name || '_ARCHIVE');
    
    -- Create archive table if not exists
    BEGIN
        v_create_sql := 'CREATE TABLE ' || v_archive_name || 
                        ' AS SELECT * FROM ' || p_table_name || 
                        ' WHERE 1=0';
        EXECUTE IMMEDIATE v_create_sql;
    EXCEPTION
        WHEN OTHERS THEN
            NULL; -- Table already exists
    END;
    
    -- Archive old records
    v_insert_sql := 'INSERT INTO ' || v_archive_name ||
                    ' SELECT * FROM ' || p_table_name ||
                    ' WHERE ' || p_date_column || ' < :cutoff';
                    
    EXECUTE IMMEDIATE v_insert_sql USING p_cutoff_date;
    
    -- Get count of archived records
    v_count := SQL%ROWCOUNT;
    
    -- Delete archived records
    v_delete_sql := 'DELETE FROM ' || p_table_name ||
                    ' WHERE ' || p_date_column || ' < :cutoff';
                    
    EXECUTE IMMEDIATE v_delete_sql USING p_cutoff_date;
    
    DBMS_OUTPUT.PUT_LINE('Archived ' || v_count || ' records from ' || p_table_name);
    
    COMMIT;
EXCEPTION
    WHEN OTHERS THEN
        ROLLBACK;
        RAISE_APPLICATION_ERROR(-20001, 'Archive failed: ' || SQLERRM);
END archive_old_records;
/

PART III — Database Design and Normalization

Chapter 7 — Functional Dependencies

7.1 Definition and Notation

Functional dependencies (FDs) are constraints that describe relationships between attributes in a relation. They form the theoretical foundation for database normalization.

Formal Definition:

Given a relation R with attribute set U, a functional dependency X → Y (read as "X determines Y" or "X functionally determines Y") holds if for any two tuples t1 and t2 in R:

If t1[X] = t2[X] then t1[Y] = t2[Y]

In other words, whenever two tuples agree on all attributes in X, they must also agree on all attributes in Y.

Example:

In a Student relation with attributes (student_id, student_name, major, advisor), the FD: student_id → student_name, major, advisor

This means each student_id uniquely determines the student's name, major, and advisor.

Notation:

Capital letters (X, Y, Z) represent sets of attributes
XY denotes the union of attribute sets X and Y
X → Y denotes functional dependency
F denotes a set of functional dependencies

Trivial vs Non-trivial FDs:

Trivial FD: X → Y where Y ⊆ X (always true)
Non-trivial FD: X → Y where Y ⊈ X
Completely non-trivial FD: X ∩ Y = ∅

Examples:

student_id, student_name → student_id (trivial)
student_id → student_name (non-trivial)
student_id → student_name (completely non-trivial if no overlap)

7.2 Armstrong's Axioms

Armstrong's axioms are a set of inference rules used to derive all functional dependencies implied by a given set F.

Reflexivity (Subset Rule): If Y ⊆ X, then X → Y (Any set of attributes determines all its subsets)

Augmentation: If X → Y, then XZ → YZ for any Z (Adding the same attributes to both sides preserves dependency)

Transitivity: If X → Y and Y → Z, then X → Z (Dependencies chain together)

Additional Rules Derived from Armstrong's Axioms:

Union: If X → Y and X → Z, then X → YZ (Combine dependencies with same left side)

Decomposition: If X → YZ, then X → Y and X → Z (Split right side into individual attributes)

Pseudo-transitivity: If X → Y and WY → Z, then XW → Z (Extended transitivity with additional attributes)

Example Proof:

Given F = {student_id → major, major → advisor} Prove: student_id → advisor

Proof:

student_id → major (given)
major → advisor (given)
student_id → advisor (transitivity: 1,2)

Closure of Functional Dependencies:

The closure F⁺ is the set of all FDs that can be derived from F using Armstrong's axioms.

Example: Given R = (A, B, C, G, H, I) and F = {A → B, A → C, CG → H, CG → I, B → H} Compute closure:

From A → B and B → H, we get A → H (transitivity) From A → B and A → C, we get A → BC (union) From CG → H and CG → I, we get CG → HI (union) From A → BC and augmentation, A → ABC

7.3 Closure of Attribute Sets

The closure of an attribute set X under FD set F, denoted X⁺, is the set of all attributes that are functionally determined by X.

Algorithm for Computing Attribute Closure:

closure = X
while changes to closure do
    for each FD Y → Z in F do
        if Y ⊆ closure then
            closure = closure ∪ Z
        end if
    end for
end while
return closure

Example:

Given R = (A, B, C, D, E, F) and F = {A → BC, E → CF, B → E, CD → EF} Compute closure of {A, B}:

Step 1: closure = {A, B} Step 2: Using A → BC: A in closure, add C → closure = {A, B, C} Step 3: Using B → E: B in closure, add E → closure = {A, B, C, E} Step 4: Using E → CF: E in closure, add C, F → closure = {A, B, C, E, F} Step 5: Using CD → EF: Need C and D, D not in closure → no change Step 6: No more changes Result: {A, B}⁺ = {A, B, C, E, F}

Uses of Attribute Closure:

Testing Superkey: X is a superkey if X⁺ contains all attributes of R
Testing FD Validity: X → Y holds if Y ⊆ X⁺
Computing F⁺: For each subset X of attributes, compute X⁺
Finding Candidate Keys: Minimal attribute sets whose closure includes all attributes

Finding Candidate Keys:

Given R = (A, B, C, D) and F = {A → C, B → D, C → D} Find candidate keys:

Try {A, B}: {A, B}⁺ = {A, B, C, D} → candidate key
Try {A, C}: {A, C}⁺ = {A, C, D} (B missing) → not key
Try {B, C}: {B, C}⁺ = {B, C, D} (A missing) → not key
Try {A, B, C}: superkey but not minimal
Try {A, D}: {A, D}⁺ = {A, D, C} (B missing) → not key
Try {B, C, D}: {B, C, D}⁺ = {B, C, D} (A missing) → not key

Only {A, B} is a candidate key.

7.4 Canonical Cover

A canonical cover (or minimal cover) Fc is a minimal set of functional dependencies equivalent to F, with the following properties:

Right side of each FD is a single attribute
No FD in Fc is redundant (can be derived from others)
Left side of each FD is minimal (no extraneous attributes)

Algorithm for Computing Canonical Cover:

Fc = F

// Step 1: Decompose right sides
Replace each X → YZ with X → Y and X → Z

// Step 2: Remove extraneous attributes from left sides
for each FD X → A in Fc do
    for each attribute B in X do
        compute (X - {B})⁺ using current Fc
        if A in (X - {B})⁺ then
            replace X → A with (X - {B}) → A
        end if
    end for
end for

// Step 3: Remove redundant FDs
for each FD f in Fc do
    compute closure using (Fc - {f})
    if f can be derived then
        remove f from Fc
    end if
end for

return Fc

Example:

Given F = {A → BC, B → C, A → B, AB → C}

Step 1: Decompose right sides F = {A → B, A → C, B → C, A → B, AB → C} Remove duplicates: {A → B, A → C, B → C, AB → C}

Step 2: Remove extraneous attributes Check AB → C:

Compute (A)⁺ using {A → B, A → C, B → C} = {A, B, C}
Since C in A⁺, B is extraneous in AB → C Replace AB → C with A → C (but A → C already exists) So remove AB → C

Now Fc = {A → B, A → C, B → C}

Step 3: Remove redundant FDs Check A → C: Compute closure of {A} using {A → B, B → C} = {A, B, C} Since C in closure, A → C is redundant → remove

Final canonical cover: Fc = {A → B, B → C}

7.5 Dependency Preservation

A decomposition is dependency-preserving if all FDs in the original set F can be checked on the individual decomposed relations without performing joins.

Definition:

Let R be a relation with FD set F, and let R₁, R₂, ..., Rₙ be a decomposition of R. The projection of F onto Rᵢ (denoted πRᵢ(F)) is the set of all X → Y in F⁺ such that XY ⊆ Rᵢ.

The decomposition is dependency-preserving if the union of all πRᵢ(F) logically implies all FDs in F.

Algorithm for Testing Dependency Preservation:

Compute canonical cover Fc of F
for each FD X → Y in Fc do
    result = X
    while changes to result do
        for each decomposed relation Rᵢ do
            t = (result ∩ Rᵢ)⁺ ∩ Rᵢ
            result = result ∪ t
        end for
    end while
    if Y ⊈ result then
        return false  // FD not preserved
    end if
end for
return true  // All FDs preserved

Example:

Given R = (A, B, C, D) and F = {A → B, B → C, C → D} Decompose into R₁ = (A, B), R₂ = (B, C), R₃ = (C, D)

Check preservation:

For A → B:

result = {A}
Using R₁: (A ∩ R₁)⁺ = A⁺ = {A, B, C, D}
(A⁺ ∩ R₁) = {A, B}
result = {A, B}
B in result → preserved

For B → C:

result = {B}
Using R₂: (B ∩ R₂)⁺ = B⁺ = {B, C, D}
(B⁺ ∩ R₂) = {B, C}
result = {B, C}
C in result → preserved

For C → D:

result = {C}
Using R₃: (C ∩ R₃)⁺ = C⁺ = {C, D}
(C⁺ ∩ R₃) = {C, D}
result = {C, D}
D in result → preserved

All FDs preserved → decomposition is dependency-preserving.

Chapter 8 — Normalization Theory

8.1 Anomalies in Databases

Database anomalies are problems that can occur in poorly designed databases, leading to data inconsistency and redundancy.

Insertion Anomalies:

Occur when certain facts cannot be recorded without recording other unrelated facts.

Example table: Student_Course (student_id, student_name, course_id, course_title, instructor, grade)

Problems:

Cannot add a new course until a student enrolls in it
Cannot add a new student until they enroll in at least one course
Adding a new course requires repeating course information for each enrollment

Deletion Anomalies:

Occur when deleting facts about one subject unintentionally deletes facts about another subject.

Example: If we delete the last student enrolled in a course, we lose all information about that course (title, instructor, etc.)

Update Anomalies:

Occur when updating a fact requires multiple changes, leading to inconsistency if some changes are missed.

Example: If a course title changes, we must update every row where that course appears. Failure to update all rows creates inconsistent data.

8.2 First Normal Form (1NF)

A relation is in 1NF if every attribute contains only atomic (indivisible) values.

Violations of 1NF:

Multivalued attributes: An attribute that can have multiple values
Composite attributes: Attributes that contain structured data
Repeating groups: Multiple attributes storing similar data (phone1, phone2, phone3)

Converting to 1NF:

Original (not in 1NF):

Student (student_id, student_name, phones)
Data: (101, 'Alice', '555-1234, 555-5678')

Option 1: Flatten into multiple rows

Student (student_id, student_name, phone)
(101, 'Alice', '555-1234')
(101, 'Alice', '555-5678')

Option 2: Separate into two tables

Student (student_id, student_name)
Student_Phone (student_id, phone)

Example - Books and Authors:

Not in 1NF:

Book (ISBN, title, authors)
(0-123-45678-9, 'Database Design', 'Codd, Date, Ullman')

Convert to 1NF:

Book_Author (ISBN, title, author)
(0-123-45678-9, 'Database Design', 'Codd')
(0-123-45678-9, 'Database Design', 'Date')
(0-123-45678-9, 'Database Design', 'Ullman')

Or better:

Book (ISBN, title)
Author (author_id, author_name)
Book_Author (ISBN, author_id)

8.3 Second Normal Form (2NF)

A relation is in 2NF if it is in 1NF and every non-key attribute is fully functionally dependent on the entire primary key (no partial dependencies).

Partial Dependency: When a non-key attribute depends on only part of a composite primary key.

Example Violating 2NF:

Enrollment (student_id, course_id, student_name, course_title, grade)
Primary key: (student_id, course_id)

Dependencies:
- student_id, course_id → grade (full dependency)
- student_id → student_name (partial dependency)
- course_id → course_title (partial dependency)

This violates 2NF because student_name depends on only part of the key (student_id), and course_title depends on only part (course_id).

Converting to 2NF:

Decompose into:

Student (student_id, student_name)
Course (course_id, course_title)
Enrollment (student_id, course_id, grade)

Now all non-key attributes depend on the entire primary key in each relation.

Another Example - Order Items:

Order_Item (order_id, product_id, order_date, customer_name, product_name, quantity, price)
Primary key: (order_id, product_id)

Dependencies:
- order_id, product_id → quantity, price
- order_id → order_date, customer_name
- product_id → product_name

Partial dependencies exist → violates 2NF

Solution:

Order (order_id, order_date, customer_name)
Product (product_id, product_name)
Order_Detail (order_id, product_id, quantity, price)

8.4 Third Normal Form (3NF)

A relation is in 3NF if it is in 2NF and no non-key attribute is transitively dependent on the primary key.

Transitive Dependency: When a non-key attribute depends on another non-key attribute.

Example Violating 3NF:

Employee (emp_id, emp_name, dept_id, dept_name, dept_location)
Primary key: emp_id

Dependencies:
- emp_id → dept_id
- dept_id → dept_name, dept_location (transitive dependency)

dept_name and dept_location depend on dept_id, not directly on emp_id.

Converting to 3NF:

Decompose into:

Employee (emp_id, emp_name, dept_id)
Department (dept_id, dept_name, dept_location)

Another Example - Student Advising:

Student (student_id, student_name, advisor_id, advisor_name, advisor_office)
Primary key: student_id

Dependencies:
- student_id → advisor_id
- advisor_id → advisor_name, advisor_office (transitive)

Solution:
Student (student_id, student_name, advisor_id)
Advisor (advisor_id, advisor_name, advisor_office)

8.5 Boyce-Codd Normal Form (BCNF)

A relation is in BCNF if for every non-trivial functional dependency X → Y, X is a superkey.

BCNF is stricter than 3NF—every BCNF relation is in 3NF, but not vice versa.

Example Violating BCNF (but in 3NF):

Student_Advisor (student_id, major, advisor)
Assume:
- Each student has one major and one advisor
- Each advisor advises students in only one major
- A major may have multiple advisors

FDs:
1. student_id → major, advisor
2. advisor → major

Candidate key: student_id

Check BCNF:
- FD1: student_id → major, advisor: student_id is key → OK
- FD2: advisor → major: advisor is not a superkey → violates BCNF

This relation is in 3NF but not BCNF.

Converting to BCNF:

Decompose based on violating FD advisor → major:

Student (student_id, advisor)
Advisor_Major (advisor, major)

Now both relations satisfy BCNF:

In Student, only FD is student_id → advisor (key determines non-key)
In Advisor_Major, only FD is advisor → major (key determines non-key)

Another BCNF Example:

Enrollment (student_id, course_id, instructor, textbook)
Assume:
- Each course has one instructor
- Each instructor uses one textbook for a given course
- Students can take any course with any instructor

FDs:
- course_id → instructor
- instructor, course_id → textbook

Candidate keys: (student_id, course_id)

Check BCNF:
- course_id → instructor: course_id is not a superkey → violates BCNF

Decompose:

Course_Instructor (course_id, instructor)
Enrollment_Textbook (student_id, course_id, textbook)

But still have FD: instructor, course_id → textbook
In Enrollment_Textbook, (student_id, course_id) is key, but FD involves instructor (not part of key in this relation) → further decomposition needed

8.6 Fourth Normal Form (4NF)

4NF addresses multivalued dependencies (MVDs), which occur when independent facts are repeated in a relation.

Multivalued Dependency (MVD):

X →→ Y (read as "X multidetermines Y") means that given a value of X, the set of Y values is independent of other attributes.

Formally: For any two tuples t1 and t2 with t1[X] = t2[X], there exist tuples t3 and t4 such that:

t3[X] = t1[X], t3[Y] = t1[Y], t3[Z] = t2[Z]
t4[X] = t1[X], t4[Y] = t2[Y], t4[Z] = t1[Z] where Z = R - (X ∪ Y)

Example - Student Courses and Hobbies:

Student_Hobby_Course (student_id, hobby, course)
student_id →→ hobby
student_id →→ course

Sample data:
student_id | hobby    | course
-----------|----------|--------
101        | swimming | CS101
101        | swimming | MATH201
101        | reading  | CS101
101        | reading  | MATH201

This relation has redundancy because hobbies and courses are independent—every combination must appear.

4NF Definition:

A relation is in 4NF if for every nontrivial MVD X →→ Y, X is a superkey.

Converting to 4NF:

Decompose based on MVDs:

Student_Hobby (student_id, hobby)
Student_Course (student_id, course)

Now each relation contains only one independent fact per student.

8.7 Fifth Normal Form (5NF)

5NF (also called Project-Join Normal Form) deals with join dependencies (JDs). A relation is in 5NF if every join dependency in the relation is implied by the candidate keys.

Join Dependency (JD):

A join dependency JD(R₁, R₂, ..., Rₙ) means that R is equal to the natural join of its projections onto R₁, R₂, ..., Rₙ.

Example - Suppliers, Parts, Projects:

Supply (supplier_id, part_id, project_id)

Assume business rule: A supplier supplies a part for a project if and only if:
- The supplier supplies that part (for some project)
- The supplier supplies some part for that project
- Some supplier supplies that part for that project

This creates a join dependency: JD(* [supplier_id, part_id], 
                                   [supplier_id, project_id], 
                                   [part_id, project_id])

The relation can be reconstructed by joining three binary relations, so it should be decomposed.

Converting to 5NF:

Supplier_Part (supplier_id, part_id)
Supplier_Project (supplier_id, project_id)
Part_Project (part_id, project_id)

8.8 Multi-Valued Dependencies

Formal Definition:

Let R be a relation schema with attribute sets X, Y, and Z = R - (X ∪ Y). The multivalued dependency X →→ Y holds if for any two tuples t1 and t2 in any instance of R with t1[X] = t2[X], there exist tuples t3 and t4 in the instance such that:

t3[X] = t1[X]
t3[Y] = t1[Y]
t3[Z] = t2[Z]
t4[X] = t1[X]
t4[Y] = t2[Y]
t4[Z] = t1[Z]

Properties of MVDs:

Complementation: If X →→ Y, then X →→ (R - X - Y)
Reflexivity: If Y ⊆ X, then X →→ Y
Augmentation: If X →→ Y and V ⊆ W, then XW →→ YV
Transitivity: If X →→ Y and Y →→ Z, then X →→ (Z - Y)
Replication: Every FD is an MVD (X → Y implies X →→ Y)

Example - Employee Skills and Languages:

Employee_Skill_Lang (emp_id, skill, language)
emp_id →→ skill
emp_id →→ language

Data:
emp_id | skill     | language
-------|-----------|----------
101    | Java      | English
101    | Java      | Spanish
101    | Python    | English
101    | Python    | Spanish

This MVD creates a Cartesian product of skills and languages for each employee.

8.9 Join Dependencies

Definition:

A join dependency JD(R₁, R₂, ..., Rₙ) holds for relation R if R is equal to the natural join of its projections onto R₁, R₂, ..., Rₙ.

Example - Agency Database:

Consider a relation representing which agents represent which clients for which products:

Representation (agent_id, client_id, product_id)

Business rule: An agent represents a client for a product if and only if:
- The agent represents the client (for some product)
- The agent sells that product (to some client)
- The client uses that product (with some agent)

This creates a 3-way join dependency.

Check if decomposition is lossless:
Representation = π[agent_id, client_id](R) ⋈ π[agent_id, product_id](R) ⋈ π[client_id, product_id](R)

If this holds, the relation should be decomposed.

MVD vs JD:

An MVD is a binary join dependency (JD with two components)
JDs generalize MVDs to multiple components

8.10 Denormalization Strategies

Denormalization is the intentional introduction of redundancy to improve query performance.

When to Denormalize:

Performance requirements: Queries must be faster than normalized design allows
Read-heavy workloads: OLAP systems with few updates
Complex joins: Frequently joined tables causing performance issues
Reporting requirements: Pre-aggregated data needed for dashboards

Denormalization Techniques:

1. Pre-joined Tables:

Store frequently joined data together.

Normalized:

Order (order_id, customer_id, order_date)
Customer (customer_id, customer_name, customer_address)
Order_Item (order_id, product_id, quantity)
Product (product_id, product_name, price)

Denormalized:

Order_Fact (order_id, customer_name, customer_address, 
            order_date, product_name, quantity, price)

2. Derived Columns:

Store computed values to avoid recalculation.

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    order_date DATE,
    subtotal DECIMAL(10,2),
    tax DECIMAL(10,2),
    shipping DECIMAL(10,2),
    total DECIMAL(10,2) GENERATED ALWAYS AS 
        (subtotal + tax + shipping) STORED
);

3. Pre-aggregated Tables:

Create summary tables for reporting.

CREATE TABLE daily_sales_summary AS
SELECT 
    DATE(order_date) AS sale_date,
    product_id,
    COUNT(*) AS order_count,
    SUM(quantity) AS total_quantity,
    SUM(amount) AS total_amount
FROM orders
GROUP BY DATE(order_date), product_id;

4. Vertical Partitioning:

Split tables by column groups with different access patterns.

-- Frequently accessed columns
CREATE TABLE employee_fast (
    emp_id INT PRIMARY KEY,
    emp_name VARCHAR(100),
    department VARCHAR(50),
    salary DECIMAL(10,2)
);

-- Infrequently accessed columns
CREATE TABLE employee_slow (
    emp_id INT PRIMARY KEY,
    emergency_contact VARCHAR(100),
    health_insurance VARCHAR(50),
    notes TEXT
);

5. Horizontal Partitioning (Sharding):

Split tables by row groups.

-- Partition by date range
CREATE TABLE orders_2023 PARTITION OF orders
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

CREATE TABLE orders_2024 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

6. Materialized Views:

Pre-computed query results maintained automatically.

CREATE MATERIALIZED VIEW customer_order_summary
REFRESH COMPLETE ON DEMAND
AS
SELECT 
    c.customer_id,
    c.customer_name,
    COUNT(o.order_id) AS order_count,
    SUM(o.total) AS lifetime_value,
    MAX(o.order_date) AS last_order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;

Risks of Denormalization:

Update anomalies: Redundant data must be kept consistent
Increased storage: Redundancy consumes more space
Complex maintenance: More complex update logic
Data inconsistency: Higher risk of conflicting values
Harder to evolve: Schema changes more complex

Denormalization Checklist:

Have you optimized normalized design first?
Is the performance gain worth the redundancy cost?
Can you automate consistency maintenance?
Have you documented denormalization decisions?
Is there a plan to handle updates to redundant data?

Chapter 9 — Database Design Methodology

9.1 Top-Down Design

Top-down design starts with high-level business requirements and progressively refines them into detailed database structures.

Phases of Top-Down Design:

Phase 1: Requirements Analysis

Interview stakeholders
Review existing systems and documents
Identify business processes and rules
Document data requirements
Create use cases and data flow diagrams

Phase 2: Conceptual Design

Identify entities and relationships
Create ER/EER diagrams
Define business rules and constraints
Validate with users
Deliverable: Conceptual data model

Phase 3: Logical Design

Map ER model to relational schema
Normalize relations
Define integrity constraints
Create views for user access
Deliverable: Logical database schema

Phase 4: Physical Design

Choose storage structures
Design indexes
Plan partitioning
Estimate sizes
Create DDL scripts
Deliverable: Physical database implementation

Phase 5: Implementation and Testing

Create database objects
Load test data
Test queries and transactions
Validate performance
Document the system

Advantages of Top-Down:

Focuses on business needs
Ensures comprehensive requirements coverage
Produces well-documented design
Easier to validate with users

Disadvantages:

Time-consuming
May miss existing data sources
Can be too abstract for implementers

9.2 Bottom-Up Design

Bottom-up design starts with existing data sources and integrates them into a coherent database structure.

Phases of Bottom-Up Design:

Phase 1: Source Analysis

Inventory existing data sources
Analyze file formats and structures
Document current data usage
Identify data quality issues

Phase 2: View Integration

Identify common data elements
Resolve naming conflicts
Resolve structural conflicts
Resolve constraint conflicts

Phase 3: Schema Integration

Create global schema
Define mappings from sources
Identify and resolve redundancies
Normalize integrated schema

Phase 4: Reverse Engineering

Extract existing database schemas
Document implicit assumptions
Identify hidden constraints
Understand legacy design decisions

Advantages of Bottom-Up:

Leverages existing data assets
Faster initial results
Practical and grounded in reality
Good for data warehouse design

Disadvantages:

May perpetuate bad designs
Can miss business requirements
Difficult to achieve integration
May lack overall coherence

9.3 View Integration

View integration combines multiple user views into a unified conceptual schema.

Steps in View Integration:

Step 1: Pre-Integration Analysis

Review each view independently
Document view characteristics
Identify potential conflicts

Step 2: Conflict Resolution

Naming Conflicts:

Synonyms: Same concept, different names
Homonyms: Different concepts, same name

Example:

View1: "Customer" means person who buys
View2: "Customer" means company that buys → homonym

Structural Conflicts:

Type conflicts: Same concept, different representations
Domain conflicts: Different value ranges
Constraint conflicts: Different business rules

Example:

View1: Employee has single phone number (attribute)
View2: Employee has multiple phones (repeating group)

Step 3: Merging Views

Identify common concepts
Create superclass/subclass hierarchies
Add relationships between views
Resolve redundancies

Step 4: Schema Restructuring

Optimize integrated schema
Apply normalization
Add integrity constraints
Validate with stakeholders

View Integration Example:

View1 (Registrar): Student(student_id, name, dob, address) View2 (Financial Aid): Student(ssn, name, income, aid_amount) View3 (Housing): Resident(student_id, room, building)

Integration:

Identify student_id and ssn as different identifiers for same entity
Create unified Student entity with both identifiers
Add relationships to Housing
Resolve naming: use consistent attribute names

9.4 Schema Refinement

Schema refinement improves an initial database design to meet quality goals.

Quality Criteria:

Completeness: All required data represented
Correctness: Accurately models real world
Minimality: No redundant data
Clarity: Easy to understand
Flexibility: Accommodates change
Performance: Supports required workloads

Refinement Process:

Step 1: Identify Redundancies

Look for repeating data
Check for derived data that can be computed
Identify unnecessary duplications

Step 2: Apply Normalization

Check for 1NF violations
Remove partial dependencies (2NF)
Remove transitive dependencies (3NF)
Consider BCNF, 4NF for special cases

Step 3: Validate Against Requirements

Can all required queries be expressed?
Are all business rules enforced?
Are there missing relationships?

Step 4: Consider Performance

Identify frequent query patterns
Consider denormalization for performance
Plan indexing strategy

Step 5: Review Integrity Constraints

Add missing constraints
Ensure referential integrity
Define business rule constraints

9.5 Case Studies

Case Study 1: Banking System

Requirements:

Manage customer accounts (checking, savings)
Process transactions (deposits, withdrawals, transfers)
Track account balances
Generate monthly statements
Handle multiple account owners
Support branch network

Conceptual Design:

Entities:

Customer (customer_id, name, ssn, address, phone, email)
Account (account_id, type, balance, interest_rate, open_date)
Branch (branch_id, name, address, phone)
Transaction (transaction_id, type, amount, date, description)

Relationships:

Customer owns Account (M:N with joint ownership)
Account belongs to Branch (N:1)
Account has Transaction (1:N)

Logical Design:

CREATE TABLE customers (
    customer_id INTEGER PRIMARY KEY,
    ssn CHAR(11) UNIQUE NOT NULL,
    name VARCHAR(100) NOT NULL,
    address TEXT,
    phone VARCHAR(20),
    email VARCHAR(100),
    created_date DATE DEFAULT CURRENT_DATE
);

CREATE TABLE branches (
    branch_id INTEGER PRIMARY KEY,
    branch_name VARCHAR(100) NOT NULL,
    address TEXT,
    phone VARCHAR(20)
);

CREATE TABLE accounts (
    account_id INTEGER PRIMARY KEY,
    account_type VARCHAR(20) CHECK (account_type IN ('checking', 'savings', 'money_market')),
    balance DECIMAL(15,2) DEFAULT 0.00,
    interest_rate DECIMAL(5,4),
    open_date DATE DEFAULT CURRENT_DATE,
    branch_id INTEGER REFERENCES branches(branch_id)
);

CREATE TABLE account_owners (
    customer_id INTEGER REFERENCES customers(customer_id),
    account_id INTEGER REFERENCES accounts(account_id),
    ownership_percentage DECIMAL(5,2) DEFAULT 100.00,
    PRIMARY KEY (customer_id, account_id)
);

CREATE TABLE transactions (
    transaction_id INTEGER PRIMARY KEY,
    account_id INTEGER REFERENCES accounts(account_id),
    transaction_type VARCHAR(20) CHECK (transaction_type IN ('deposit', 'withdrawal', 'transfer', 'fee', 'interest')),
    amount DECIMAL(15,2) NOT NULL,
    balance_after DECIMAL(15,2) NOT NULL,
    transaction_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    description VARCHAR(200),
    related_account_id INTEGER REFERENCES accounts(account_id) -- for transfers
);

Physical Design:

-- Indexes for performance
CREATE INDEX idx_transactions_account ON transactions(account_id);
CREATE INDEX idx_transactions_date ON transactions(transaction_date);
CREATE INDEX idx_account_owners_customer ON account_owners(customer_id);

-- Partition transactions by date
CREATE TABLE transactions_2023 PARTITION OF transactions
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

-- Materialized view for daily balances
CREATE MATERIALIZED VIEW daily_balances AS
SELECT 
    account_id,
    DATE(transaction_date) AS balance_date,
    LAST_VALUE(balance_after) OVER (
        PARTITION BY account_id, DATE(transaction_date)
        ORDER BY transaction_date
    ) AS end_of_day_balance
FROM transactions;

Case Study 2: E-Commerce Platform

Requirements:

Product catalog with categories
Shopping cart functionality
Order processing
Customer management
Inventory tracking
Reviews and ratings

Conceptual Design:

Entities:

Customer (customer_id, email, password, name, addresses)
Product (product_id, name, description, price, stock)
Category (category_id, name, description)
Order (order_id, order_date, status, total)
OrderItem (quantity, price_at_time)
Review (rating, comment, date)
Cart (created_date)

Relationships:

Product belongs to Category (M:N)
Customer places Order (1:N)
Order contains Product (M:N through OrderItem)
Customer writes Review for Product (1:N per combination)
Customer has Cart (1:1)

Logical Design:

CREATE TABLE customers (
    customer_id INTEGER PRIMARY KEY,
    email VARCHAR(200) UNIQUE NOT NULL,
    password_hash CHAR(60) NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_login TIMESTAMP
);

CREATE TABLE addresses (
    address_id INTEGER PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(customer_id),
    address_type VARCHAR(20) CHECK (address_type IN ('shipping', 'billing')),
    street VARCHAR(200) NOT NULL,
    city VARCHAR(100) NOT NULL,
    state VARCHAR(50),
    postal_code VARCHAR(20),
    country VARCHAR(100) NOT NULL,
    is_default BOOLEAN DEFAULT FALSE
);

CREATE TABLE categories (
    category_id INTEGER PRIMARY KEY,
    category_name VARCHAR(100) NOT NULL,
    description TEXT,
    parent_category_id INTEGER REFERENCES categories(category_id)
);

CREATE TABLE products (
    product_id INTEGER PRIMARY KEY,
    product_name VARCHAR(200) NOT NULL,
    description TEXT,
    price DECIMAL(10,2) NOT NULL CHECK (price >= 0),
    stock_quantity INTEGER NOT NULL DEFAULT 0,
    sku VARCHAR(50) UNIQUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE TABLE product_categories (
    product_id INTEGER REFERENCES products(product_id),
    category_id INTEGER REFERENCES categories(category_id),
    PRIMARY KEY (product_id, category_id)
);

CREATE TABLE carts (
    cart_id INTEGER PRIMARY KEY,
    customer_id INTEGER UNIQUE REFERENCES customers(customer_id),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP
);

CREATE TABLE cart_items (
    cart_id INTEGER REFERENCES carts(cart_id),
    product_id INTEGER REFERENCES products(product_id),
    quantity INTEGER NOT NULL CHECK (quantity > 0),
    added_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (cart_id, product_id)
);

CREATE TABLE orders (
    order_id INTEGER PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(customer_id),
    order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status VARCHAR(20) CHECK (status IN ('pending', 'processing', 'shipped', 'delivered', 'cancelled')),
    shipping_address_id INTEGER REFERENCES addresses(address_id),
    billing_address_id INTEGER REFERENCES addresses(address_id),
    subtotal DECIMAL(10,2) NOT NULL,
    tax DECIMAL(10,2) NOT NULL,
    shipping_cost DECIMAL(10,2) NOT NULL,
    total DECIMAL(10,2) NOT NULL,
    payment_method VARCHAR(50),
    payment_status VARCHAR(20)
);

CREATE TABLE order_items (
    order_id INTEGER REFERENCES orders(order_id),
    product_id INTEGER REFERENCES products(product_id),
    quantity INTEGER NOT NULL,
    price_at_time DECIMAL(10,2) NOT NULL, -- Snapshot of product price
    PRIMARY KEY (order_id, product_id)
);

CREATE TABLE reviews (
    review_id INTEGER PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(customer_id),
    product_id INTEGER REFERENCES products(product_id),
    rating INTEGER CHECK (rating BETWEEN 1 AND 5),
    comment TEXT,
    review_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(customer_id, product_id) -- One review per customer per product
);

Physical Design with Performance Optimizations:

-- Full-text search for products
CREATE INDEX idx_products_search ON products USING GIN (to_tsvector('english', product_name || ' ' || description));

-- Partial index for active carts
CREATE INDEX idx_active_carts ON carts(customer_id) 
WHERE updated_at > CURRENT_TIMESTAMP - INTERVAL '7 days';

-- Composite index for order history
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date DESC);

-- Inventory alerts
CREATE VIEW low_stock_products AS
SELECT product_id, product_name, stock_quantity
FROM products
WHERE stock_quantity < 10;

Case Study 3: Healthcare System

Requirements:

Patient records management
Appointment scheduling
Medical history tracking
Prescription management
Insurance billing
HIPAA compliance

Conceptual Design:

Entities:

Patient (patient_id, name, dob, contact, insurance)
Provider (provider_id, name, specialty, credentials)
Appointment (appointment_id, date_time, status, reason)
Diagnosis (diagnosis_code, description, category)
Medication (medication_code, name, dosage_form, manufacturer)
Prescription (prescribed_date, dosage, instructions, refills)
Encounter (encounter_id, date, type, notes)
InsuranceClaim (claim_id, amount, status, date)

Relationships:

Patient schedules Appointment with Provider
Provider diagnoses Patient (through Encounter)
Provider prescribes Medication to Patient
Encounter documents Patient-Provider interaction
InsuranceClaim covers Encounter

Logical Design (with HIPAA considerations):

-- Separate PHI (Protected Health Information) for security
CREATE TABLE patients_demographics (
    patient_id INTEGER PRIMARY KEY,
    mrn VARCHAR(20) UNIQUE NOT NULL, -- Medical Record Number
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    date_of_birth DATE NOT NULL,
    gender CHAR(1) CHECK (gender IN ('M', 'F', 'O')),
    ssn_last4 CHAR(4), -- Only last 4 digits for security
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Sensitive data encrypted at application level
CREATE TABLE patients_contact (
    patient_id INTEGER PRIMARY KEY REFERENCES patients_demographics(patient_id),
    address_encrypted TEXT NOT NULL, -- Encrypted
    phone_encrypted VARCHAR(200) NOT NULL, -- Encrypted
    email_encrypted VARCHAR(200), -- Encrypted
    emergency_contact_encrypted TEXT -- Encrypted
);

CREATE TABLE patients_insurance (
    insurance_id INTEGER PRIMARY KEY,
    patient_id INTEGER REFERENCES patients_demographics(patient_id),
    provider_name VARCHAR(200) NOT NULL,
    policy_number_encrypted VARCHAR(200) NOT NULL, -- Encrypted
    group_number_encrypted VARCHAR(200), -- Encrypted
    coverage_start_date DATE,
    coverage_end_date DATE
);

CREATE TABLE providers (
    provider_id INTEGER PRIMARY KEY,
    npi_number CHAR(10) UNIQUE NOT NULL, -- National Provider Identifier
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    specialty VARCHAR(100) NOT NULL,
    credentials VARCHAR(100),
    department VARCHAR(100)
);

CREATE TABLE appointments (
    appointment_id INTEGER PRIMARY KEY,
    patient_id INTEGER REFERENCES patients_demographics(patient_id),
    provider_id INTEGER REFERENCES providers(provider_id),
    scheduled_time TIMESTAMP NOT NULL,
    duration_minutes INTEGER DEFAULT 30,
    status VARCHAR(20) CHECK (status IN ('scheduled', 'completed', 'cancelled', 'no-show')),
    reason VARCHAR(500),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_appointments_patient (patient_id),
    INDEX idx_appointments_provider (provider_id),
    INDEX idx_appointments_datetime (scheduled_time)
);

CREATE TABLE encounters (
    encounter_id INTEGER PRIMARY KEY,
    patient_id INTEGER REFERENCES patients_demographics(patient_id),
    provider_id INTEGER REFERENCES providers(provider_id),
    appointment_id INTEGER REFERENCES appointments(appointment_id),
    encounter_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    encounter_type VARCHAR(50) CHECK (encounter_type IN ('office', 'telehealth', 'emergency', 'inpatient')),
    chief_complaint TEXT,
    assessment_notes TEXT, -- May contain PHI
    treatment_plan TEXT,
    followup_required BOOLEAN DEFAULT FALSE
);

CREATE TABLE diagnoses (
    diagnosis_id INTEGER PRIMARY KEY,
    icd10_code VARCHAR(10) NOT NULL, -- ICD-10 diagnosis code
    description TEXT NOT NULL,
    category VARCHAR(100)
);

CREATE TABLE encounter_diagnoses (
    encounter_id INTEGER REFERENCES encounters(encounter_id),
    diagnosis_id INTEGER REFERENCES diagnoses(diagnosis_id),
    is_primary BOOLEAN DEFAULT FALSE,
    notes TEXT,
    PRIMARY KEY (encounter_id, diagnosis_id)
);

CREATE TABLE medications (
    medication_id INTEGER PRIMARY KEY,
    ndc_code VARCHAR(20) UNIQUE NOT NULL, -- National Drug Code
    medication_name VARCHAR(200) NOT NULL,
    generic_name VARCHAR(200),
    dosage_form VARCHAR(50),
    manufacturer VARCHAR(200),
    requires_prescription BOOLEAN DEFAULT TRUE
);

CREATE TABLE prescriptions (
    prescription_id INTEGER PRIMARY KEY,
    encounter_id INTEGER REFERENCES encounters(encounter_id),
    patient_id INTEGER REFERENCES patients_demographics(patient_id),
    provider_id INTEGER REFERENCES providers(provider_id),
    medication_id INTEGER REFERENCES medications(medication_id),
    prescribed_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    dosage_instructions TEXT NOT NULL,
    quantity DECIMAL(10,2) NOT NULL,
    refills_authorized INTEGER DEFAULT 0,
    dispense_as_written BOOLEAN DEFAULT FALSE,
    status VARCHAR(20) CHECK (status IN ('active', 'completed', 'cancelled', 'expired'))
);

-- Audit log for HIPAA compliance
CREATE TABLE access_log (
    log_id INTEGER PRIMARY KEY,
    user_id VARCHAR(100) NOT NULL,
    access_type VARCHAR(50) CHECK (access_type IN ('view', 'create', 'update', 'delete')),
    table_name VARCHAR(100) NOT NULL,
    record_id INTEGER,
    access_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    ip_address VARCHAR(45),
    reason_for_access VARCHAR(500)
);

-- Create views for common queries
CREATE VIEW patient_summary AS
SELECT 
    p.patient_id,
    p.mrn,
    CONCAT(p.last_name, ', ', p.first_name) AS patient_name,
    p.date_of_birth,
    TIMESTAMPDIFF(YEAR, p.date_of_birth, CURDATE()) AS age,
    COUNT(DISTINCT e.encounter_id) AS total_encounters,
    MAX(e.encounter_date) AS last_encounter,
    COUNT(DISTINCT pr.prescription_id) AS active_prescriptions
FROM patients_demographics p
LEFT JOIN encounters e ON p.patient_id = e.patient_id
LEFT JOIN prescriptions pr ON p.patient_id = pr.patient_id AND pr.status = 'active'
GROUP BY p.patient_id;

Security Implementation:

-- Role-based access control
CREATE ROLE healthcare_provider;
CREATE ROLE billing_staff;
CREATE ROLE admin_staff;

-- Grant minimal necessary privileges
GRANT SELECT ON patients_demographics TO healthcare_provider;
GRANT SELECT, INSERT, UPDATE ON encounters TO healthcare_provider;
GRANT SELECT ON patients_insurance TO billing_staff;
GRANT SELECT, UPDATE ON insurance_claims TO billing_staff;

-- Row-level security (PostgreSQL example)
CREATE POLICY patient_access_policy ON encounters
    USING (patient_id = current_setting('app.current_patient_id')::INT);
    
-- Encrypt sensitive data
CREATE EXTENSION IF NOT EXISTS pgcrypto;
UPDATE patients_contact 
SET address_encrypted = pgp_sym_encrypt(address, 'encryption-key');

PART IV — Transaction Management & Concurrency

Chapter 10 — Transaction Concepts

10.1 Definition of Transaction

A transaction is a logical unit of work that accesses and possibly modifies the contents of a database. Transactions transform the database from one consistent state to another.

Formal Definition:

A transaction T is a sequence of read and write operations (reads and writes) on database items that must appear as a single, indivisible unit.

Transaction Operations:

BEGIN TRANSACTION: Marks the start of transaction execution
READ(X): Reads database item X into program variable
WRITE(X): Writes program variable value to database item X
COMMIT: Signals successful end of transaction, making changes permanent
ROLLBACK: Signals unsuccessful end, undoing any changes made

Example Transaction - Funds Transfer:

BEGIN TRANSACTION;
READ(account_A);
account_A.balance = account_A.balance - 100;
WRITE(account_A);
READ(account_B);
account_B.balance = account_B.balance + 100;
WRITE(account_B);
COMMIT;

10.2 ACID Properties

ACID properties ensure reliable transaction processing.

Atomicity:

A transaction is an all-or-nothing unit. Either all operations complete successfully, or none take effect.

If the funds transfer transaction fails after debiting account A but before crediting account B, atomicity ensures the debit is undone (rolled back).

Implementation mechanisms:

Shadow copying: Maintain before and after images
Transaction log: Record changes for rollback
Write-ahead logging: Log changes before applying them

Consistency:

A transaction preserves database consistency. If the database is consistent before the transaction, it must be consistent after.

Consistency includes:

Explicit constraints (primary keys, foreign keys, CHECK constraints)
Implicit business rules (account balance never negative)
Application-level consistency (sum of debits equals sum of credits)

Isolation:

Concurrent transactions appear to execute in isolation. Each transaction should see the database as if it were the only active transaction.

Isolation levels (from lowest to highest):

Read Uncommitted (dirty reads possible)
Read Committed (no dirty reads)
Repeatable Read (no non-repeatable reads)
Serializable (transactions appear sequential)

Durability:

Once a transaction commits, its effects persist even in case of system failure. Committed data survives:

Power failures
System crashes
Media failures (with proper backup/recovery)

Implementation mechanisms:

Write-ahead logging
Checkpointing
Replicated storage

10.3 Transaction States

State Transition Diagram:

            +---------+
            | Active  |
            +---------+
                 |
        +--------+--------+
        |                 |
        v                 v
+---------+         +---------+
| Partially|        |  Failed  |
| Committed|        +---------+
+---------+              |
        |                 |
        |                 v
        |           +---------+
        |           | Aborted |
        |           +---------+
        v                 |
+---------+               |
| Committed| <------------+
+---------+

State Descriptions:

Active: Initial state; transaction is executing read/write operations.

Partially Committed: After final statement executes but before changes become permanent. Transaction has completed its work but may still be aborted.

Committed: Transaction successfully completed. All changes are permanent and visible to other transactions.

Failed: Transaction cannot proceed normally due to logical error or system issue.

Aborted: Transaction rolled back; database restored to state before transaction began. Can be restarted later.

Example Transaction States:

BEGIN TRANSACTION;           -- Enter Active state
UPDATE accounts SET balance = balance - 100 WHERE account_id = 1;
UPDATE accounts SET balance = balance + 100 WHERE account_id = 2;
-- Last statement executed, enter Partially Committed
COMMIT;                      -- Enter Committed state

-- OR if error occurs
BEGIN TRANSACTION;           -- Active
UPDATE accounts SET balance = balance - 100 WHERE account_id = 1;
-- System crash occurs here -- Enter Failed state
-- After recovery -- Enter Aborted state

10.4 Schedules and Serializability

A schedule is an ordering of operations from multiple transactions.

Types of Schedules:

Serial Schedule: Transactions execute one after another without interleaving.

T1: READ(A), WRITE(A), READ(B), WRITE(B)
T2: READ(A), WRITE(A), READ(B), WRITE(B)

Serializable Schedule: Interleaved execution that produces same result as some serial schedule.

T1: READ(A)    -- T1 reads A
T2: READ(A)    -- T2 reads A
T1: WRITE(A)   -- T1 writes A
T2: WRITE(A)   -- T2 writes A
T1: READ(B)    -- T1 reads B
T1: WRITE(B)   -- T1 writes B
T2: READ(B)    -- T2 reads B
T2: WRITE(B)   -- T2 writes B

Non-serializable Schedule: Interleaved execution that yields different result than any serial schedule.

Conflicting Operations:

Two operations conflict if:

They belong to different transactions
They access the same data item
At least one is a write operation

Conflict types:

Read-Write conflict (R-W): Transaction reads item written by another
Write-Read conflict (W-R): Transaction writes item read by another
Write-Write conflict (W-W): Both transactions write same item

10.5 Conflict vs View Serializability

Conflict Serializability:

A schedule is conflict serializable if it can be transformed into a serial schedule by swapping non-conflicting operations.

Precedence Graph Method:

Create a node for each transaction
Draw edge Ti → Tj if there's a conflicting operation from Ti before Tj
Schedule is conflict serializable if graph has no cycles

Example:

Schedule S:

T1: READ(A)
T2: READ(A)
T1: WRITE(B)
T2: WRITE(A)
T1: WRITE(A)
T2: WRITE(B)

Find conflicts:

T1 WRITE(B) and T2 WRITE(B) conflict (W-W) → T1 → T2
T2 WRITE(A) and T1 WRITE(A) conflict (W-W) → T2 → T1

Graph has cycle T1→T2 and T2→T1 → Not conflict serializable.

View Serializability:

A schedule is view serializable if:

For each data item, initial reads match some serial schedule
For each data item, writes that produce values read match some serial schedule
Final writes match some serial schedule

View serializability is less restrictive than conflict serializability. Some schedules are view serializable but not conflict serializable.

Example (View Serializable but not Conflict Serializable):

T1: WRITE(A)
T2: WRITE(A)
T1: WRITE(B)
T2: WRITE(B)

This schedule has write-write conflicts but produces same result as serial schedule T1,T2 if writes are idempotent.

Chapter 11 — Concurrency Control

11.1 Lock-Based Protocols

Locks synchronize access to data items, ensuring serializability.

Lock Types:

Shared Lock (S): Transaction can read but not write. Multiple transactions can hold shared locks simultaneously.

Exclusive Lock (X): Transaction can read and write. Only one transaction can hold exclusive lock.

Lock Compatibility Matrix:

Requested	Existing S	Existing X
S	Yes	No
X	No	No

Lock Primitives:

lock-S(A): Request shared lock on item A
lock-X(A): Request exclusive lock on item A
unlock(A): Release lock on item A

Example with Locking:

T1: lock-X(A)     -- Get exclusive lock on A
T1: READ(A)
T1: WRITE(A)
T1: unlock(A)     -- Release lock

T2: lock-S(A)     -- Wait until T1 unlocks
T2: READ(A)
T2: unlock(A)

11.2 Two-Phase Locking (2PL)

2PL ensures conflict serializability through a locking protocol with two phases.

Growing Phase: Transaction acquires locks but cannot release them.

Shrinking Phase: Transaction releases locks but cannot acquire new ones.

Strict 2PL: All exclusive locks held until commit/abort. Prevents cascading aborts.

Rigorous 2PL: All locks held until commit/abort. Most common implementation.

Example of 2PL:

T1: lock-X(A)     -- Growing phase
T1: lock-S(B)     -- Still growing
T1: READ(A)
T1: WRITE(A)
T1: READ(B)
T1: unlock(A)     -- Start shrinking (violates strict 2PL)
T1: unlock(B)

Strict 2PL version:

T1: lock-X(A)     -- Growing
T1: lock-S(B)     -- Growing
T1: READ(A)
T1: WRITE(A)
T1: READ(B)
T1: COMMIT        -- Shrinking happens after commit
-- Locks released automatically

11.3 Deadlocks and Prevention

Deadlock occurs when transactions wait indefinitely for each other's locks.

Deadlock Example:

T1: lock-X(A)     -- T1 locks A
T2: lock-X(B)     -- T2 locks B
T1: lock-X(B)     -- T1 waits for T2's B
T2: lock-X(A)     -- T2 waits for T1's A
-- Deadlock! Both waiting forever

Deadlock Detection:

Wait-for Graph:

Nodes represent transactions
Edge Ti → Tj if Ti waiting for lock held by Tj
Cycle indicates deadlock

Detection Algorithm:

CREATE FUNCTION detect_deadlock() RETURNS boolean AS $$
DECLARE
    cycle_found boolean := false;
BEGIN
    -- Build wait-for graph and check for cycles
    -- If cycle found, abort youngest transaction
    RETURN cycle_found;
END;
$$ LANGUAGE plpgsql;

Deadlock Prevention:

Wait-Die Scheme (non-preemptive):

Older transaction T waits for younger T'
Younger transaction T aborts if it needs lock held by older

Wound-Wait Scheme (preemptive):

Older transaction T preempts (wounds) younger T'
Younger transaction T waits if it needs lock held by older

Timeout-Based Prevention:

Transaction aborts if wait exceeds timeout
Simple but may abort unnecessarily

Example - Wound-Wait:

T1 (older) holds lock on A
T2 (younger) requests lock on A
Result: T2 aborts (wounded)

T2 (younger) holds lock on B
T1 (older) requests lock on B
Result: T2 aborts, T1 gets lock

11.4 Timestamp Ordering

Each transaction gets unique timestamp (usually system time). Schedule must be equivalent to timestamp order.

Basic Timestamp Ordering:

For each data item X, maintain:

W-timestamp(X): Largest timestamp of transaction that wrote X
R-timestamp(X): Largest timestamp of transaction that read X

Rules:

Read Operation: If TS(T) < W-timestamp(X), reject (read too late) Else allow read, update R-timestamp(X) = max(R-timestamp(X), TS(T))

Write Operation: If TS(T) < R-timestamp(X) OR TS(T) < W-timestamp(X), reject (write too late) Else allow write, update W-timestamp(X) = TS(T)

Example:

Initial: W-timestamp(A)=0, R-timestamp(A)=0

T1 (TS=100): READ(A)     -- TS=100 > W-timestamp=0 → allowed
                         -- R-timestamp(A)=100

T2 (TS=200): WRITE(A)    -- TS=200 > R-timestamp=100 → allowed
                         -- W-timestamp(A)=200

T3 (TS=150): WRITE(A)    -- TS=150 < R-timestamp=100? No
                         -- TS=150 < W-timestamp=200? Yes → reject

Thomas Write Rule:

If TS(T) < W-timestamp(X), ignore write (it's obsolete) instead of rejecting.

11.5 Multiversion Concurrency Control (MVCC)

MVCC maintains multiple versions of data items, allowing readers to see consistent snapshot without blocking writers.

How MVCC Works:

Each write creates new version of data item
Each version has timestamp (or transaction ID)
Readers see snapshot of database at their start time
Writers don't block readers; readers don't block writers

Version Management:

Each data item version contains:

Value
Create timestamp (transaction that created it)
Delete timestamp (transaction that deleted it, if any)

Snapshot Isolation:

Transaction sees all versions committed before its start timestamp. No dirty reads, no non-repeatable reads, but may have write skew anomalies.

Example - PostgreSQL MVCC:

-- Transaction 1 (XID=100)
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- Creates new version of row with xmin=100

-- Transaction 2 (XID=101) starts after T1
BEGIN;
SELECT balance FROM accounts WHERE id = 1;
-- Sees old version (xmin < 101 and not deleted)

-- T1 commits
COMMIT;

-- T2 continues
SELECT balance FROM accounts WHERE id = 1;
-- Still sees old version (snapshot at T2 start)
COMMIT;

11.6 Optimistic Concurrency Control

Optimistic protocols assume conflicts are rare. Transactions proceed without locking, validate before commit.

Three Phases:

Read Phase:

Transaction reads data, makes changes to private workspace
No locking, no waiting

Validation Phase:

Check if transaction conflicts with others
Ensure serializability

Write Phase:

If validation succeeds, make changes permanent
If validation fails, abort and restart

Validation Techniques:

Timestamp-Based Validation: Assign each transaction timestamp at start (S) and validation (V) Check: For all conflicting transactions Tj with Vj < Vi:

Either Tj finishes write phase before Ti starts
Or Tj's write set doesn't intersect Ti's read set

Example:

T1: READ(A), READ(B)      -- Private workspace
T2: READ(B), WRITE(B)     -- Private workspace

Validation:
- T1's read set = {A,B}
- T2's write set = {B}
- Conflict on B → T2 must validate first or abort

11.7 Isolation Levels (Read Committed, Serializable)

SQL defines four isolation levels with increasing consistency.

Read Uncommitted:

Dirty reads possible
No locks (or very short locks)
Highest performance, lowest consistency

Read Committed:

No dirty reads
Each query sees only committed data
Non-repeatable reads possible
Default in Oracle, PostgreSQL

Repeatable Read:

No dirty reads, no non-repeatable reads
Phantom reads possible
Default in MySQL

Serializable:

Highest isolation level
Transactions appear serial
No anomalies possible

Anomalies:

Dirty Read: Reading uncommitted data that may be rolled back.

Non-repeatable Read: Same query returns different values within transaction.

Phantom Read: New rows appear/disappear in subsequent queries.

Isolation Level Comparison:

Level	Dirty Read	Non-repeatable	Phantom
Read Uncommitted	Possible	Possible	Possible
Read Committed	Not Possible	Possible	Possible
Repeatable Read	Not Possible	Not Possible	Possible
Serializable	Not Possible	Not Possible	Not Possible

Setting Isolation Levels:

-- SQL Server
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
-- queries
COMMIT;

-- PostgreSQL
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- queries
COMMIT;

-- MySQL
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
START TRANSACTION;
-- queries
COMMIT;

Implementation Differences:

Read Committed in PostgreSQL:

Uses MVCC
Each statement gets new snapshot
Ensures statement-level consistency

Serializable in PostgreSQL:

Uses Serializable Snapshot Isolation (SSI)
Detects serialization anomalies
May abort transactions with conflicts

Chapter 12 — Database Recovery

12.1 Failure Types

Database systems must handle various failure types.

Transaction Failures:

Logical errors (division by zero, constraint violation)
System errors (deadlock, resource exhaustion)
User actions (explicit ROLLBACK)

System Crashes:

Power failure
Operating system crash
DBMS software failure
Hardware failure (CPU, memory)

Media Failures:

Disk head crash
Disk controller failure
Bad sectors
Accidental file deletion

Natural Disasters:

Fire, flood, earthquake
Theft, vandalism
Power grid failure

12.2 Log-Based Recovery

Write-ahead logging (WAL) ensures durability and supports recovery.

Log Structure:

Each log record contains:

LSN: Log Sequence Number (increasing)
TransID: Transaction identifier
PageID: Modified page identifier
Old value: Before image (for undo)
New value: After image (for redo)
PrevLSN: Previous LSN for same transaction (linked list)

Log Record Types:

UPDATE: Record of data modification
COMMIT: Transaction committed
ABORT: Transaction aborted
CHECKPOINT: Checkpoint record
CLR: Compensation Log Record (for undo)

Write-Ahead Logging Protocol:

Log record must be written to stable storage before corresponding data page
All log records for transaction must be written before COMMIT record
Data pages can be written anytime after log records are stable

Example Log Sequence:

LSN 100: T1 BEGIN
LSN 101: T1 UPDATE (A, 100, 200)  -- A changed from 100 to 200
LSN 102: T1 UPDATE (B, 50, 150)   -- B changed from 50 to 150
LSN 103: T1 COMMIT
LSN 104: T2 BEGIN
LSN 105: T2 UPDATE (C, 75, 125)
-- Crash occurs here

12.3 Checkpointing

Checkpoints reduce recovery time by establishing known-good database states.

Simple Checkpoint:

Suspend all transactions
Write all dirty (modified) buffers to disk
Write checkpoint record to log
Resume transactions

Problem: Suspending transactions hurts performance.

Fuzzy Checkpoint:

Write checkpoint record to log (includes list of active transactions)
Continue transactions normally
Background process writes dirty buffers
No suspension needed

Checkpoint Record Contents:

List of active transactions
Their most recent LSNs
Dirty buffer information

Recovery with Checkpoint:

During recovery, only need to process log after the last checkpoint, plus any transactions active at checkpoint.

12.4 Shadow Paging

Alternative to log-based recovery using page table indirection.

How Shadow Paging Works:

Maintain two page tables:
- Current page table: Points to current database pages
- Shadow page table: Points to original pages (before transaction)
During transaction:
- All changes go to new pages (not overwriting original)
- Current page table updated to point to new pages
- Shadow page table unchanged
On commit:
- Make shadow page table point to new pages
- Discard original pages
On abort:
- Discard new pages
- Keep using shadow page table

Advantages:

No logging overhead
Fast recovery (no undo/redo)
Simple implementation

Disadvantages:

Data fragmentation
Garbage collection needed
Poor for large transactions

12.5 ARIES Algorithm

ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the most widely used recovery method.

Key Principles:

Write-ahead logging: Log before writing data
Repeating history during redo: Redo all operations from log
Logging during undo: Log undo actions (CLRs) to avoid redoing undone work

ARIES Data Structures:

Log: Sequential log records with LSNs
Page LSN: LSN of last log record affecting page
Dirty Page Table (DPT): Pages modified since last checkpoint
Transaction Table: Active transactions and their last LSN

ARIES Recovery Phases:

Phase 1 - Analysis:

Start from last checkpoint
Scan log forward
Build Transaction Table and Dirty Page Table
Determine starting point for redo

Phase 2 - Redo:

Start from smallest LSN in Dirty Page Table
Scan log forward
Reapply all updates (even for aborted transactions)
Ensures database state matches logged history

Phase 3 - Undo:

Scan backward from end
Undo all transactions active at crash
Write CLRs for each undo
Continue until all aborted transactions undone

ARIES Example:

Checkpoint at LSN 100:
- Active transactions: T1 (last LSN=95)
- Dirty pages: P1 (LSN=95)

Log after checkpoint:
LSN 101: T1 UPDATE P2  -- T1 continues
LSN 102: T2 BEGIN
LSN 103: T2 UPDATE P1
LSN 104: T1 UPDATE P3
LSN 105: T2 UPDATE P2
-- Crash occurs

Recovery:

Analysis Phase:
- Transaction Table: T1(last=104), T2(last=105)
- Dirty Page Table: P2(LSN=101), P1(LSN=103), P3(LSN=104)
- Redo start = 101

Redo Phase:
- Redo LSN 101-105 (all operations)

Undo Phase:
- Undo T2 first (youngest)
- Undo LSN 105, write CLR 106
- Undo LSN 103, write CLR 107
- Undo T1 next
- Undo LSN 104, write CLR 108
- Undo LSN 101, write CLR 109
- Undo LSN 95 (from before checkpoint), write CLR 110

12.6 Backup and Restore Strategies

Backup Types:

Full Backup:

Complete copy of entire database
Time-consuming, large storage
Baseline for recovery

Incremental Backup:

Only changes since last backup
Faster, smaller
Requires full backup for restore

Differential Backup:

All changes since last full backup
Larger than incremental, faster restore

Log Backup:

Transaction log files
Enables point-in-time recovery
Continuous or scheduled

Backup Strategies:

Strategy 1: Full + Differential

Sunday:   Full backup
Monday:   Differential (changes since Sunday)
Tuesday:  Differential (changes since Sunday)
Wednesday: Differential
Thursday: Differential
Friday:   Differential
Saturday: Differential

Restore: Full + most recent differential

Strategy 2: Full + Incremental

Sunday:   Full backup
Monday:   Incremental (Monday's changes)
Tuesday:  Incremental (Tuesday's changes)
Wednesday: Incremental
Thursday: Incremental
Friday:   Incremental
Saturday: Incremental

Restore: Full + all incrementals in order

Strategy 3: Full + Logs

Sunday 00:00: Full backup
Continuous:   Transaction log backups every 15 minutes

Restore: Full + all log backups up to desired time

Point-in-Time Recovery:

-- PostgreSQL
pg_basebackup -D /backup/full
-- Then restore with recovery target time
recovery_target_time = '2024-01-15 14:30:00'

-- SQL Server
RESTORE DATABASE db FROM full_backup
WITH NORECOVERY;
RESTORE LOG db FROM log_backup
WITH RECOVERY, STOPAT = '2024-01-15 14:30:00';

Backup Commands:

-- MySQL
mysqldump --all-databases --single-transaction > full_backup.sql
mysql < full_backup.sql

-- PostgreSQL
pg_dump dbname > dbname.sql
psql dbname < dbname.sql

-- Oracle
RMAN> BACKUP DATABASE PLUS ARCHIVELOG;
RMAN> RESTORE DATABASE;
RMAN> RECOVER DATABASE;

Restore Testing:

Critical practice: Regularly test backups by restoring to test environment.

-- Create test database
CREATE DATABASE restore_test;
-- Restore backup to test
-- Run validation queries
-- Compare with production
DROP DATABASE restore_test;

PART V — Storage and Indexing

Chapter 13 — Storage Systems

13.1 File Organization

Database files organize data on disk for efficient access.

Storage Hierarchy:

Primary Storage (Cache): CPU registers, L1/L2 cache (nanoseconds)
Main Memory: RAM (microseconds)
Flash Storage: SSD (microseconds to milliseconds)
Magnetic Storage: HDD (milliseconds)
Optical/Tape: Archive storage (seconds to minutes)

Database File Structure:

Data files: Store actual table data
Index files: Store index structures
Control files: Database configuration and state
Log files: Transaction logs for recovery
Temp files: Sort/hash temporary storage

Pages/Blocks:

Database manages storage in fixed-size units (typically 4KB-64KB):

Each page contains multiple records
Page header with metadata
Records stored within page
Free space tracking

13.2 Heap Files

Heap files store records without any particular order.

Structure:

New records appended at end
Deleted records create gaps
Free space management tracks available slots

Advantages:

Fast inserts (append)
Simple implementation
Good for full table scans

Disadvantages:

Slow searches (need full scan)
Space fragmentation
Inefficient for ordered access

Heap File Organization:

Page 0: [Header][Record1][Record2][Free Space]
Page 1: [Header][Record3][Free Space][Record4]
Page 2: [Header][Record5][Record6][Record7]

Free Space Management:

Methods to track free space:

Linked list: Pages with free space linked together
Bitmap: Bit per page indicating free space availability
Space map: Separate structure tracking page free space

13.3 Sequential Files

Sequential files maintain records in sorted order by search key.

Structure:

Records ordered by key value
Overflow pages for inserts
Periodic reorganization

Advantages:

Fast sequential access in key order
Efficient range queries
Binary search possible

Disadvantages:

Slow inserts (need to maintain order)
Overflow chains degrade performance
Periodic reorganization needed

Maintaining Order:

Insert record with key=45 into sorted file:

Original: [10][20][30][40][50][60]
Insert 45:
- Find location (between 40 and 50)
- If space in page, insert
- If no space, create overflow page
- Link overflow to main page

13.4 Hashing

Hash files distribute records across buckets using hash function.

Static Hashing:

Fixed number of buckets
Hash function maps key to bucket
Simple but doesn't adapt to growth

Extendible Hashing:

Dynamic bucket growth
Directory of pointers to buckets
Splits buckets when full
Handles growing data gracefully

Linear Hashing:

Gradual bucket growth
No directory needed
Round-robin splitting
Good for concurrent access

Hash Function Properties:

Deterministic: Same key always same hash
Uniform: Keys evenly distributed
Fast: Minimal computation overhead

Example - Extendible Hashing:

Global depth = 2
Directory: [00→A, 01→B, 10→C, 11→D]

Bucket A (depth=2): keys with hash 00: [k1, k4]
Bucket B (depth=2): keys with hash 01: [k2, k5]
Bucket C (depth=2): keys with hash 10: [k3]
Bucket D (depth=2): keys with hash 11: [k6, k7, k8]

Insert k9 with hash 10:
- Bucket C overflows
- Split bucket C into C (100) and E (101)
- Global depth becomes 3
- Directory doubles

13.5 RAID Architecture

RAID (Redundant Array of Independent Disks) combines multiple disks for performance and reliability.

RAID Levels:

RAID 0 (Striping):

Data striped across disks
No redundancy
Best performance, no fault tolerance

RAID 1 (Mirroring):

Data duplicated on two disks
Excellent fault tolerance
50% storage efficiency

RAID 5 (Striping with Parity):

Data striped with distributed parity
Can survive one disk failure
Good read performance
Write penalty (must update parity)

RAID 6 (Striping with Double Parity):

Two parity blocks per stripe
Survives two disk failures
Higher write penalty

RAID 10 (Striped Mirrors):

Combination of RAID 1 and 0
Excellent performance and fault tolerance
50% storage efficiency
Most common for high-performance databases

RAID Comparison:

Level	Min Disks	Read	Write	Fault Tolerance	Efficiency
RAID 0	2	Excellent	Excellent	None	100%
RAID 1	2	Good	Good	1 disk	50%
RAID 5	3	Good	Fair	1 disk	67-94%
RAID 6	4	Good	Poor	2 disks	50-88%
RAID 10	4	Excellent	Excellent	1 per mirror	50%

Chapter 14 — Indexing Structures

14.1 Ordered Indices

Ordered indices maintain keys in sorted order for efficient search.

Primary Index:

Defined on sorted file's ordering key
One index entry per data block
Sparse index

Clustering Index:

Defined on non-key field with sorted data
One index entry per distinct value or block
Data physically ordered by index key

Secondary Index:

Defined on any field
One index entry per record
Dense index
Data not physically ordered

Index Structure:

Primary Index (sparse):
Key | Block Pointer
10  | Block 0
20  | Block 1
30  | Block 2

Data Blocks:
Block 0: [10][12][15][18]
Block 1: [20][22][25][28]
Block 2: [30][32][35][38]

14.2 B-Tree

B-Tree is a balanced tree structure maintaining sorted data.

Properties:

Balanced tree (all leaves at same level)
Each node contains multiple keys
Nodes have pointers to children
Keys in node are sorted

Node Structure:

Contains between ceil(n/2) and n keys
Internal nodes: keys + child pointers
Leaf nodes: keys + data pointers (or data)

B-Tree of Order m:

Root can have 2 to m children
Internal nodes have ceil(m/2) to m children
Leaves have ceil(m/2) to m keys

Example B-Tree (order 5):

                    [30, 50]
                   /    |    \
                  /     |     \
           [10,20]   [40,45]   [60,70,80]

14.3 B+ Tree

B+ Tree is the most common index structure in databases. It extends B-Tree with data only in leaves.

Differences from B-Tree:

All data stored in leaves
Internal nodes store only keys (routing values)
Leaves linked for sequential access
More keys per node (smaller tree)

B+ Tree Structure:

Internal Nodes (routing only):
                     [50, 100]
                    /    |    \
                   /     |     \
           [25,40]   [75,90]   [120,150]
              |         |          |
Leaf Nodes (data):
[10,15,20]→[25,30,35]→[40,45]→[50,55,60]→[75,80,85]→...

Advantages:

Balanced (logarithmic search)
Efficient range queries (linked leaves)
High fanout (fewer levels)
Good cache performance

Operations:

Search:

function search(key):
    node = root
    while node is not leaf:
        find child pointer for key
        node = child
    search leaf for key
    return data pointer or null

Insert:

function insert(key, data):
    find leaf for key
    if leaf has space:
        insert in leaf
    else:
        split leaf into two
        distribute keys evenly
        promote middle key to parent
        if parent overflows, split recursively

Delete:

function delete(key):
    find and remove key from leaf
    if leaf underflows:
        try to borrow from sibling
        if cannot borrow, merge with sibling
        update parent keys
        if parent underflows, propagate

14.4 Hash Indexing

Hash indexes use hash functions for direct access.

Static Hash Index:

Fixed number of buckets
Hash function maps key to bucket
Buckets may have overflow chains

Extendible Hash Index:

Dynamic bucket growth
Directory of bucket pointers
Splits buckets when full

Linear Hash Index:

Gradual bucket addition
No directory overhead
Round-robin splitting

Hash Index Example:

Hash function: h(key) = key mod 4

Bucket 0: [4, 8, 12, 16]  (overflow chain for 16)
Bucket 1: [1, 5, 9, 13]
Bucket 2: [2, 6, 10, 14]
Bucket 3: [3, 7, 11, 15]

14.5 Bitmap Indexes

Bitmap indexes use bit arrays for low-cardinality columns.

Structure:

One bitmap per distinct value
Each bit represents a row
Bit set if row has that value

Example - Gender Column:

Rows: 1:M, 2:F, 3:M, 4:F, 5:M

Gender='M' bitmap: [1,0,1,0,1]
Gender='F' bitmap: [0,1,0,1,0]

Bitmap Operations:

AND: Find rows with multiple conditions
OR: Find rows with any condition
NOT: Find rows without value
COUNT: Fast population count

Query Example:

SELECT * FROM employees
WHERE gender = 'M' AND department = 'IT'

-- Use bitmaps:
-- Gender_M bitmap:    [1,0,1,0,1]
-- Dept_IT bitmap:     [1,1,0,0,1]
-- AND result:         [1,0,0,0,1]
-- Rows 1 and 5 match

Advantages:

Fast for low-cardinality columns
Efficient storage (bit compression)
Fast boolean operations

14.6 R-Tree (Spatial Index)

R-Tree indexes multidimensional data (spatial, geometric).

Structure:

Hierarchical organization of bounding boxes
Each node has minimum bounding rectangle (MBR)
Leaf nodes contain actual objects
Internal nodes contain MBRs of children

R-Tree Properties:

Balanced tree
Overlapping regions allowed
Minimizes area, overlap, margin
Supports spatial queries

Spatial Queries:

Point Query: Find all objects containing point P Range Query: Find all objects within rectangle R Nearest Neighbor: Find k closest objects to point P

R-Tree Example:

Root MBR: covers entire space
    ├── Node1 MBR: covers Europe
    │   ├── Leaf: Paris (point)
    │   ├── Leaf: London (point)
    │   └── Leaf: Berlin (point)
    └── Node2 MBR: covers Asia
        ├── Leaf: Tokyo (point)
        ├── Leaf: Beijing (point)
        └── Leaf: Seoul (point)

14.7 Index Selection Strategies

Choosing appropriate indexes is crucial for performance.

Factors to Consider:

Query Patterns:

Which columns in WHERE clauses?
Which columns in JOIN conditions?
Which columns in ORDER BY?
Which columns in GROUP BY?

Data Characteristics:

Cardinality (number of distinct values)
Data distribution (skew)
Update frequency
Table size

Index Types by Use Case:

Primary Key/Unique Constraint:

-- Automatic unique index
CREATE TABLE users (
    user_id INT PRIMARY KEY,
    email VARCHAR(200) UNIQUE
);

Foreign Keys:

-- Index foreign key columns
CREATE INDEX idx_orders_customer ON orders(customer_id);

Search Conditions:

-- Single column for equality
CREATE INDEX idx_users_email ON users(email);

-- Composite for multiple conditions
CREATE INDEX idx_orders_customer_date 
ON orders(customer_id, order_date);

Range Queries:

-- Index on range column
CREATE INDEX idx_products_price ON products(price);

-- Composite with equality first, then range
CREATE INDEX idx_orders_customer_date 
ON orders(customer_id, order_date);

Sorting/Ordering:

-- Index supports ORDER BY without sort
CREATE INDEX idx_users_registration ON users(registration_date DESC);

Covering Indexes:

-- Index includes all needed columns
CREATE INDEX idx_users_email_name 
ON users(email) INCLUDE (first_name, last_name);

Partial Indexes:

-- Index only active users
CREATE INDEX idx_active_users_email 
ON users(email) WHERE status = 'active';

Index Selection Guidelines:

Start with obvious indexes: PK, FK, frequent query columns
Analyze actual queries: Use query logs to identify patterns
Consider selectivity: High-cardinality columns benefit most
Balance reads vs writes: Each index adds write overhead
Monitor and adjust: Use database statistics and query plans
Avoid over-indexing: Too many indexes hurts INSERT/UPDATE
Test with realistic data: Development data may not reflect production

Index Maintenance:

-- Rebuild fragmented indexes
ALTER INDEX idx_name REBUILD;

-- Update statistics
ANALYZE TABLE table_name;

-- Monitor index usage (PostgreSQL)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
ORDER BY idx_scan;

PART VI — Query Processing & Optimization

Chapter 15 — Query Processing

15.1 Query Parsing

Parsing converts SQL text into internal representation.

Parsing Phases:

Lexical Analysis:

Break SQL into tokens (keywords, identifiers, literals)
Identify language elements
Report lexical errors

Syntax Analysis:

Check grammar rules
Build parse tree
Validate SQL structure
Report syntax errors

Semantic Analysis:

Verify objects exist
Check column references
Validate data types
Resolve views and synonyms
Check permissions

Parse Tree Example:

SELECT s.name, c.title
FROM students s
JOIN enrollments e ON s.id = e.student_id
JOIN courses c ON e.course_id = c.id
WHERE e.grade = 'A'

Parse tree representation:

Query
├── SELECT
│   ├── Column: s.name
│   └── Column: c.title
├── FROM
│   ├── Table: students AS s
│   ├── JOIN: enrollments AS e ON s.id = e.student_id
│   └── JOIN: courses AS c ON e.course_id = c.id
└── WHERE
    └── Condition: e.grade = 'A'

15.2 Query Translation

Translation converts parse tree to relational algebra.

Translation Rules:

SELECT:

SELECT * FROM students WHERE major = 'CS'

→ σ_major='CS'(students)

JOIN:

SELECT * FROM students s, enrollments e
WHERE s.id = e.student_id

→ s ⋈_{s.id=e.student_id} e

GROUP BY:

SELECT major, COUNT(*) FROM students GROUP BY major

→ γ_major, COUNT(*)(students)

Combined:

SELECT s.name, c.title
FROM students s, enrollments e, courses c
WHERE s.id = e.student_id AND e.course_id = c.id AND e.grade = 'A'

→ π_name,title(σ_grade='A'(s ⋈ e ⋈ c))

15.3 Logical Query Plan

Logical plan represents query operations without physical details.

Logical Operators:

Scan: Read table (full or index)
Select: Filter rows (σ)
Project: Choose columns (π)
Join: Combine tables (⋈)
Aggregate: Group and aggregate (γ)
Sort: Order rows (τ)
Distinct: Remove duplicates (δ)

Logical Plan Example:

Query:

SELECT department, AVG(salary)
FROM employees
WHERE hire_date > '2020-01-01'
GROUP BY department
HAVING COUNT(*) > 5

Logical plan:

π_department, avg_salary
│
γ_department, AVG(salary)→avg_salary, COUNT(*)→cnt
│
σ_hire_date > '2020-01-01'
│
employees

15.4 Physical Query Plan

Physical plan adds implementation details to logical plan.

Physical Operators:

Table Scan: Sequential read of entire table
Index Scan: Read using index
Index Seek: Direct lookup via index
Nested Loops Join: For each outer row, scan inner
Hash Join: Build hash table on one input
Merge Join: Sort both inputs then merge
Sort: Explicit sorting (may use index)
Aggregate: Hash or sort-based grouping

Physical Plan Example:

SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.state = 'CA'

Physical plan options:

Option 1 (Nested Loops):

Nested Loops Join
├── Index Seek on customers (state='CA')
└── Index Seek on orders (customer_id=@id)

Option 2 (Hash Join):

Hash Join
├── Index Seek on customers (state='CA')
└── Table Scan on orders

Option 3 (Merge Join):

Merge Join
├── Sort (customers filtered by state='CA')
└── Sort (orders)

15.5 Join Algorithms

Nested Loop Join:

for each row r in outer table:
    for each row s in inner table:
        if r.key = s.key:
            output (r, s)

Variations:

Simple nested loop: O(n*m)
Index nested loop: Inner table indexed → O(n * log m)
Block nested loop: Process in blocks → better cache usage

Sort-Merge Join:

// Sort both tables on join key
sort(outer, key)
sort(inner, key)

// Merge
i = 0, j = 0
while i < outer.length and j < inner.length:
    if outer[i].key < inner[j].key:
        i++
    else if outer[i].key > inner[j].key:
        j++
    else:
        // Match found
        output all combinations
        i++, j++

Complexity: O(n log n + m log m + n + m)

Hash Join:

// Build phase
for each row s in inner table:
    hash = hash(s.key)
    add s to hash table[hash]

// Probe phase
for each row r in outer table:
    hash = hash(r.key)
    for each s in hash table[hash]:
        if r.key = s.key:
            output (r, s)

Complexity: O(n + m) average case

Join Algorithm Selection:

Scenario	Recommended Algorithm
One table small	Nested Loop
Both tables large, sorted	Merge Join
Both tables large, unsorted	Hash Join
Index available	Index Nested Loop
Highly skewed data	Consider hash or merge

Chapter 16 — Query Optimization

16.1 Cost-Based Optimization

Cost-based optimizer estimates execution costs and chooses cheapest plan.

Cost Components:

I/O Cost:

Sequential reads (cheaper)
Random reads (more expensive)
Writing intermediate results

CPU Cost:

Tuple processing
Predicate evaluation
Sorting/comparisons
Hash computations

Memory Cost:

Buffer pool usage
Sort/hash memory
Temporary storage

Network Cost:

Data transfer between nodes
Distributed query overhead

Cost Estimation:

-- Table statistics
SELECT 
    reltuples AS row_count,
    relpages AS page_count
FROM pg_class
WHERE relname = 'employees';

-- Column statistics
SELECT 
    n_distinct AS distinct_values,
    most_common_vals,
    most_common_freqs
FROM pg_stats
WHERE tablename = 'employees' AND attname = 'department';

Example Cost Calculation:

Table employees: 100,000 rows, 1,000 pages Index on department: height=3, 50 distinct values

Query: SELECT * FROM employees WHERE department = 'IT' (10% of rows)

Full Table Scan Cost:

Read all pages: 1,000 I/Os
CPU: evaluate predicate on 100,000 rows
Total: ~1,000 I/O cost units

Index Scan Cost:

Traverse index: 3 I/Os
Read matching rows: 10,000 rows (maybe 100 pages if clustered)
Total: 103 I/Os + random I/O penalty

16.2 Heuristic Optimization

Heuristic optimization applies rules without detailed cost analysis.

Common Heuristics:

Push selections down: Apply filters early
Project early: Reduce tuple size
Reorder joins: Smaller intermediate results first
Combine operations: Eliminate unnecessary steps

Heuristic Transformation Example:

Original:

π_name(σ_department='IT'(employees ⋈ departments))

Apply heuristics:

Push selection: σ_department='IT' before join
Project early: π_name after join

Result:

π_name((σ_department='IT'(employees)) ⋈ departments)

Join Order Heuristics:

Given joins: A ⋈ B ⋈ C ⋈ D

Heuristic rules:

Smallest relations first
Most selective filters first
Consider available indexes

16.3 Statistics and Histograms

Statistics help optimizer estimate result sizes.

Table-Level Statistics:

Number of rows (cardinality)
Number of pages (size)
Average row length
Last analyzed timestamp

Column-Level Statistics:

Number of distinct values
Null fraction
Most common values
Data distribution

Histograms:

Equi-width histogram:

Value Range: 0-100
Bucket 1: 0-20   (500 rows)
Bucket 2: 21-40  (200 rows)
Bucket 3: 41-60  (800 rows)
Bucket 4: 61-80  (400 rows)
Bucket 5: 81-100 (600 rows)

Equi-height histogram:

Each bucket has ~500 rows
Bucket 1: 0-15
Bucket 2: 16-35
Bucket 3: 36-45
Bucket 4: 46-70
Bucket 5: 71-100

Most Common Values (MCV):

Value 'IT'      : occurs 2000 times (10%)
Value 'Sales'   : occurs 1800 times (9%)
Value 'Eng'     : occurs 1500 times (7.5%)
Other values    : 73.5% spread across 47 values

16.4 Execution Plan Analysis

Understanding execution plans is crucial for query tuning.

Reading Execution Plans:

-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders WHERE customer_id = 123;

-- Output:
Index Scan using idx_orders_customer on orders  
  (cost=0.29..8.31 rows=1 width=32) 
  (actual time=0.023..0.025 rows=1 loops=1)
  Index Cond: (customer_id = 123)
  Buffers: shared hit=4
Planning Time: 0.123 ms
Execution Time: 0.045 ms

Plan Components:

Operation: What's being done (Index Scan, Hash Join)
Cost: Estimated cost (startup..total)
Rows: Estimated/actual row count
Width: Average row size in bytes
Actual time: Actual execution time
Loops: How many times operation executed
Buffers: Buffer cache usage

Common Plan Issues:

Sequential Scan on Large Table:

Seq Scan on orders (cost=0.00..8341.00 rows=100000 width=40)
  Filter: (total > 1000)

Solution: Add index on total

Nested Loops on Large Tables:

Nested Loop (cost=1000.00..50000.00 rows=5000)
  -> Seq Scan on large_table1
  -> Index Scan on large_table2

Solution: Consider hash or merge join

Sort Without Index:

Sort (cost=5000.00..5125.00 rows=50000)
  Sort Key: order_date
  -> Seq Scan on orders

Solution: Index on order_date

16.5 Distributed Query Optimization

Distributed queries add complexity of data location and network costs.

Challenges:

Data fragmentation across nodes
Network transmission costs
Local processing costs vary
Site autonomy constraints

Optimization Techniques:

Query Shipping: Move query to data

Execute query fragment at each site
Combine results centrally

Data Shipping: Move data to query

Transfer needed tables to central site
Execute query locally

Hybrid Approaches:

Ship only relevant fragments
Semijoin reduction
Bloom filters

Distributed Query Plan Example:

Query: SELECT * FROM customers c JOIN orders o ON c.id = o.customer_id Data: customers in NY, orders in London

Options:

Ship customers to London:
- Transfer customers (100MB) to London
- Execute join in London
- Return results (10MB)
Ship orders to NY:
- Transfer orders (1GB) to NY
- Execute join in NY
- Return results (10MB)
Semijoin reduction:
- NY sends customer IDs to London
- London filters orders by those IDs
- London sends filtered orders (100MB) to NY
- NY performs join

Cost estimation includes:

Data transfer volume
Network bandwidth
Local processing costs
Result size

PART VII — Distributed and Cloud Databases

Chapter 17 — Distributed Databases

17.1 Distributed Database Concepts

A distributed database is a single logical database physically distributed across multiple computers.

Characteristics:

Data stored at multiple sites
Sites connected by network
Each site has local autonomy
Distributed Transaction Management
Location transparency

Advantages:

Reliability and availability
Scalability
Data locality
Economics (cheaper than centralized mainframe)

Disadvantages:

Complexity
Cost of network communication
Security challenges
Integrity control harder

Transparency Levels:

Fragmentation Transparency: User unaware of data fragmentation

Location Transparency: User unaware of data location

Replication Transparency: User unaware of data replication

Failure Transparency: System handles failures transparently

17.2 Fragmentation

Fragmentation divides relations into smaller pieces stored at different sites.

Horizontal Fragmentation:

Split relation by rows based on predicates.

Customers (customer_id, name, country, credit_limit)

Fragment 1 (Europe): country IN ('UK', 'FR', 'DE')
Fragment 2 (Americas): country IN ('US', 'CA', 'BR')
Fragment 3 (Asia): country IN ('JP', 'CN', 'IN')

Vertical Fragmentation:

Split relation by columns.

Employee (emp_id, name, salary, dept_id, health_info)

Fragment 1 (Public): emp_id, name, dept_id
Fragment 2 (HR only): emp_id, salary
Fragment 3 (Medical): emp_id, health_info

Mixed Fragmentation:

Combine horizontal and vertical.

First vertical: Employee_Public (emp_id, name, dept_id)
Then horizontal: Dept10_Employees, Dept20_Employees

Correctness Rules:

Completeness: All data appears somewhere
Reconstruction: Can reconstruct original relation
Disjointness: Fragments don't overlap unnecessarily

17.3 Replication Strategies

Replication maintains copies of data at multiple sites.

Full Replication:

Complete copy at every site
Highest availability
Update overhead

Partial Replication:

Selected fragments replicated
Balance availability and overhead

No Replication:

Each fragment at one site
Simplest, least overhead
Single point of failure

Replication Timing:

Synchronous Replication:

All copies updated in same transaction
Strong consistency
Higher latency

Asynchronous Replication:

Updates propagate after commit
Better performance
Temporary inconsistency

Master-Slave Replication:

Master Site: Accepts all writes
    ↓ asynchronous
Slave 1: Read-only copy
Slave 2: Read-only copy
Slave 3: Read-only copy

Multi-Master Replication:

Site A: Accepts writes
   ↓ bidirectional sync
Site B: Accepts writes
   ↓ bidirectional sync
Site C: Accepts writes

17.4 Distributed Transactions

Distributed transactions span multiple sites while maintaining ACID properties.

Two-Phase Commit (2PC) Protocol:

Phase 1 - Prepare:

Coordinator to all participants:
    "Prepare to commit transaction T"

Each participant:
    - Check if can commit
    - Write prepare record to log
    - Reply YES or NO

Phase 2 - Commit/Abort:

If all replied YES:
    Coordinator writes commit record
    Broadcast COMMIT to all participants
Else:
    Coordinator writes abort record
    Broadcast ABORT to all participants

Each participant:
    - Complete transaction
    - Write commit/abort record
    - Acknowledge

2PC States:

Initial → Prepare Sent → Waiting → 
    if all YES → Commit Sent → Committed
    if any NO  → Abort Sent → Aborted

Problems with 2PC:

Blocking: Coordinator failure blocks participants
Performance: Multiple network round trips
Scalability: Coordinator becomes bottleneck

17.5 Two-Phase Commit (2PC)

Detailed 2PC Protocol:

Step 1: Application sends COMMIT to coordinator

Step 2: Coordinator sends PREPARE to all participants

Step 3: Each participant:

Force-write prepare record to log
Send VOTE_COMMIT or VOTE_ABORT

Step 4: Coordinator collects votes

If all VOTE_COMMIT: write COMMIT record, send COMMIT
If any VOTE_ABORT: write ABORT record, send ABORT

Step 5: Participants:

Receive decision
Force-write decision to log
Complete transaction
Send ACK

Step 6: Coordinator receives all ACKs

Write end record (optional)
Forget transaction

Failure Handling:

Participant failure during prepare:

Coordinator times out → abort

Coordinator failure during commit:

Participants query each other
Use consensus to determine outcome

Network partition:

Timeout-based decisions
May need human intervention

17.6 Three-Phase Commit (3PC)

3PC adds an extra phase to reduce blocking problems.

Phase 1 - CanCommit: Similar to 2PC prepare, but no final decision yet.

Phase 2 - PreCommit: If all ready, coordinator sends pre-commit. Participants acknowledge and prepare to commit.

Phase 3 - DoCommit: Coordinator sends final commit.

Advantages:

Non-blocking if coordinator fails during commit
Participants can timeout and decide

Disadvantages:

More messages
Still vulnerable to network partitions
Complex implementation

Chapter 18 — NoSQL Databases

18.1 Limitations of RDBMS

Scalability Challenges:

Vertical scaling only (bigger servers)
Difficult to distribute across many nodes
Connection pooling limits

Schema Rigidity:

Fixed schema requires migrations
Semi-structured data difficult
Evolving requirements hard

Performance:

JOIN overhead at scale
ACID overhead for simple operations
Not optimized for specific access patterns

Data Model Mismatch:

Object-relational impedance mismatch
Complex mapping for nested data
Aggregation-heavy queries slow

18.2 CAP Theorem

CAP theorem states distributed systems can only guarantee two of three properties:

Consistency (C): All nodes see same data simultaneously

Availability (A): Every request receives response (success/failure)

Partition Tolerance (P): System continues despite network partitions

CAP Combinations:

System Type	Properties	Examples
CP	Consistency + Partition	HBase, MongoDB (default)
AP	Availability + Partition	Cassandra, CouchDB
CA	Consistency + Availability	Traditional RDBMS (single node)

PACELC Extension:

If Partition (P): trade C vs A
Else (E): trade Latency (L) vs Consistency (C)

18.3 Key-Value Stores

Characteristics:

Simple data model: key → value
Values are opaque (blobs)
High performance
Horizontal scaling

Operations:

GET(key) → value
PUT(key, value)
DELETE(key)
Batch operations

Examples:

Redis:

SET user:1001 "{name:'Alice',email:'alice@example.com'}"
GET user:1001
HSET user:1001 name "Alice"
HGET user:1001 name

Amazon DynamoDB:

# Python with boto3
table.put_item(
    Item={
        'user_id': '1001',
        'name': 'Alice',
        'email': 'alice@example.com'
    }
)
response = table.get_item(Key={'user_id': '1001'})

Use Cases:

Session storage
User profiles
Shopping carts
Real-time leaderboards
Caching layer

18.4 Document Databases

Characteristics:

Store semi-structured documents (JSON, BSON, XML)
Schema-flexible
Nested data structures
Document-level operations

MongoDB Example:

// Insert document
db.users.insertOne({
    name: "Alice Johnson",
    email: "alice@example.com",
    address: {
        street: "123 Main St",
        city: "Boston",
        zip: "02101"
    },
    orders: [
        { product: "laptop", price: 1200 },
        { product: "mouse", price: 25 }
    ]
})

// Query
db.users.find({ "address.city": "Boston" })

// Update
db.users.updateOne(
    { email: "alice@example.com" },
    { $push: { orders: { product: "keyboard", price: 75 } } }
)

// Index
db.users.createIndex({ "address.city": 1 })

Query Features:

Field selection
Range queries
Regular expressions
Geospatial queries
Aggregation pipeline

Aggregation Pipeline:

db.orders.aggregate([
    { $match: { status: "completed" } },
    { $group: { _id: "$customer_id", total: { $sum: "$amount" } } },
    { $sort: { total: -1 } },
    { $limit: 10 }
])

18.5 Column-Family Databases

Characteristics:

Store data in column families
Each row can have different columns
Optimized for wide tables
Good for time-series data

Cassandra Data Model:

-- Keyspace (database)
CREATE KEYSPACE store 
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};

-- Column family (table)
CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    name TEXT,
    email TEXT,
    created_at TIMESTAMP
);

-- Wide row design for time series
CREATE TABLE sensor_data (
    sensor_id TEXT,
    timestamp TIMESTAMP,
    temperature FLOAT,
    humidity FLOAT,
    pressure FLOAT,
    PRIMARY KEY (sensor_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

-- Queries
INSERT INTO users (user_id, name, email) 
VALUES (uuid(), 'Alice', 'alice@example.com');

SELECT * FROM sensor_data 
WHERE sensor_id = 'sensor_01' 
AND timestamp >= '2024-01-01';

Cassandra Features:

Tunable consistency
Linear scalability
No single point of failure
Optimized for writes

18.6 Graph Databases

Characteristics:

Nodes (entities) and edges (relationships)
Property graphs (attributes on nodes/edges)
Optimized for relationship traversal
Pattern matching queries

Neo4j Example:

// Create nodes
CREATE (alice:Person {name: 'Alice', age: 30})
CREATE (bob:Person {name: 'Bob', age: 32})
CREATE (product:Product {name: 'Laptop', price: 1200})

// Create relationships
CREATE (alice)-[:KNOWS]->(bob)
CREATE (alice)-[:BOUGHT {date: '2024-01-15'}]->(product)

// Query friends of friends
MATCH (alice:Person {name: 'Alice'})-[:KNOWS*2]->(friend)
RETURN DISTINCT friend.name

// Find purchase recommendations
MATCH (user:Person {name: 'Alice'})-[:BOUGHT]->(product)<-[:BOUGHT]-(other)-[:BOUGHT]->(rec)
WHERE NOT (user)-[:BOUGHT]->(rec)
RETURN rec.name, COUNT(*) AS frequency
ORDER BY frequency DESC
LIMIT 5

// Shortest path
MATCH p = shortestPath(
    (alice:Person {name: 'Alice'})-[:KNOWS*]-(charlie:Person {name: 'Charlie'})
)
RETURN p

Use Cases:

Social networks
Recommendation engines
Fraud detection
Network/IT operations
Knowledge graphs

18.7 Comparison: SQL vs NoSQL

Aspect	SQL	NoSQL
Data Model	Tables, rows	Various (doc, graph, key-value)
Schema	Fixed, rigid	Flexible, dynamic
ACID	Full ACID	Varies (eventual consistency common)
Joins	Supported	Limited or none
Scaling	Vertical	Horizontal
Query Language	SQL	API or specific language
Transactions	Multi-row ACID	Often single-document only
Use Cases	Complex queries, strict consistency	High volume, flexible schema

Chapter 19 — NewSQL and Modern Databases

19.1 Distributed SQL

NewSQL combines SQL familiarity with NoSQL scalability.

Characteristics:

SQL interface
ACID transactions across nodes
Horizontal scaling
Distributed query execution

Key Features:

Automatic sharding
Distributed query optimization
Consensus-based replication
Global consistency

19.2 Horizontal Scalability

Sharding Strategies:

Range-Based Sharding:

Shard 1: IDs 1-10000
Shard 2: IDs 10001-20000
Shard 3: IDs 20001-30000

Hash-Based Sharding:

shard = hash(key) % N

Directory-Based Sharding: Maintain lookup table for key→shard mapping.

Consistent Hashing: Minimal reshuffling when nodes added/removed.

19.3 Examples

CockroachDB:

-- CockroachDB extends PostgreSQL syntax
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name STRING,
    email STRING UNIQUE,
    created_at TIMESTAMP DEFAULT now()
);

-- Automatic distribution
SHOW RANGES FROM TABLE users;

-- Distributed transactions
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 'abc';
UPDATE accounts SET balance = balance + 100 WHERE id = 'xyz';
COMMIT;  -- Atomic across nodes

-- Geo-partitioning for data locality
ALTER TABLE european_customers
CONFIGURE ZONE USING constraints = '[+region=europe]';

Google Spanner:

Features:

Global distribution
TrueTime API for external consistency
SQL interface
Automatic replication
Strong consistency

-- Spanner SQL
CREATE TABLE Users (
    UserId INT64 NOT NULL,
    Name STRING(MAX),
    Email STRING(MAX)
) PRIMARY KEY (UserId);

-- Interleave tables for locality
CREATE TABLE Posts (
    UserId INT64 NOT NULL,
    PostId INT64 NOT NULL,
    Content STRING(MAX)
) PRIMARY KEY (UserId, PostId),
INTERLEAVE IN PARENT Users ON DELETE CASCADE;

YugabyteDB:

PostgreSQL-compatible
Cassandra-inspired storage
Distributed ACID transactions
Multi-region deployment

PART VIII — Big Data & Data Warehousing

Chapter 20 — Data Warehousing

20.1 OLTP vs OLAP

OLTP (Online Transaction Processing):

Many small transactions
Current data
Normalized schema
Row-oriented storage
Fast queries/writes

OLAP (Online Analytical Processing):

Few complex queries
Historical data
Denormalized schema
Column-oriented storage
Aggregations, summaries

Comparison:

Aspect	OLTP	OLAP
Users	Clerks, customers	Analysts, executives
Operations	Read/update	Read mostly
Query type	Simple, repetitive	Complex, ad-hoc
Data volume	GB	TB to PB
Response time	Milliseconds	Seconds to minutes
Design	Application-oriented	Subject-oriented

20.2 Star Schema

Star schema is a denormalized data warehouse design.

Components:

Fact Table:

Measures (numeric, additive)
Foreign keys to dimensions
Large (millions of rows)

Dimension Tables:

Descriptive attributes
Denormalized
Smaller than fact

Star Schema Example:

Fact_Sales
├── date_key (FK to Dim_Date)
├── product_key (FK to Dim_Product)
├── customer_key (FK to Dim_Customer)
├── store_key (FK to Dim_Store)
├── quantity_sold
├── unit_price
└── total_amount

Dim_Date: date_key, date, year, quarter, month, day
Dim_Product: product_key, product_name, category, brand, supplier
Dim_Customer: customer_key, name, address, city, state, segment
Dim_Store: store_key, store_name, region, manager

Query Example:

SELECT 
    d.year,
    p.category,
    SUM(f.total_amount) AS sales
FROM Fact_Sales f
JOIN Dim_Date d ON f.date_key = d.date_key
JOIN Dim_Product p ON f.product_key = p.product_key
WHERE d.year = 2024
GROUP BY d.year, p.category
ORDER BY sales DESC;

20.3 Snowflake Schema

Snowflake schema normalizes dimension tables.

Snowflake Example:

Dim_Product
├── product_key
├── product_name
├── category_key (FK to Dim_Category)
└── supplier_key (FK to Dim_Supplier)

Dim_Category: category_key, category_name, department
Dim_Supplier: supplier_key, supplier_name, city, country

Advantages:

Less redundancy
Easier maintenance
Space savings

Disadvantages:

More joins
Slower queries
More complex

20.4 ETL Process

ETL (Extract, Transform, Load) moves data from sources to warehouse.

Extract:

Read from source systems
Handle different formats
Incremental or full load
Change data capture

Transform:

Clean data (nulls, defaults)
Validate business rules
Aggregate/calculate
Conform dimensions
Handle slowly changing dimensions

Load:

Insert into warehouse
Update existing records
Build indexes
Update aggregates
Verify counts

ETL Example (Python with pandas):

import pandas as pd
from sqlalchemy import create_engine

# Extract
df = pd.read_csv('sales_data.csv')

# Transform
df['order_date'] = pd.to_datetime(df['order_date'])
df['total'] = df['quantity'] * df['price']
df = df.drop_duplicates()
df = df[df['total'] > 0]

# Load
engine = create_engine('postgresql://user:pass@localhost/warehouse')
df.to_sql('staging_sales', engine, if_exists='replace')

# Run SQL transformations
with engine.connect() as conn:
    conn.execute("""
        INSERT INTO fact_sales
        SELECT s.*, d.date_key, p.product_key
        FROM staging_sales s
        JOIN dim_date d ON s.order_date = d.date
        JOIN dim_product p ON s.product_id = p.product_id
    """)

20.5 Data Marts

Data marts are subject-specific data warehouse subsets.

Types:

Dependent Data Mart:

Derived from enterprise warehouse
Consistent data
Single source of truth

Independent Data Mart:

Built directly from sources
Faster to build
Potential inconsistency

Data Mart Examples:

Sales Mart:
- Fact_Sales
- Dim_Product
- Dim_Customer
- Dim_Date
- Dim_Store

Inventory Mart:
- Fact_Inventory
- Dim_Product
- Dim_Warehouse
- Dim_Date

20.6 OLAP Operations

OLAP operations enable interactive data analysis.

Roll-up (Drill-up): Increase aggregation level.

Daily → Monthly → Quarterly → Yearly
City → State → Region → Country
Product → Category → Department

Drill-down: Decrease aggregation level (more detail).

Yearly → Quarterly → Monthly → Daily

Slice: Select one dimension value.

Sales for 2024 only

Dice: Select multiple dimension values.

Sales for 2024, Product Category 'Electronics', Region 'West'

Pivot (Cross-tab): Reorient view (rows to columns).

-- Pivot example
SELECT *
FROM (
    SELECT year, quarter, total
    FROM sales
) src
PIVOT (
    SUM(total)
    FOR quarter IN ('Q1' AS Q1, 'Q2' AS Q2, 'Q3' AS Q3, 'Q4' AS Q4)
) pvt;

Drill-across: Query multiple fact tables.

SELECT d.year, 
       s.total_sales,
       i.total_inventory
FROM dim_date d
LEFT JOIN sales_fact s ON d.date_key = s.date_key
LEFT JOIN inventory_fact i ON d.date_key = i.date_key

Chapter 21 — Big Data Ecosystem

21.1 Big Data Characteristics

The 5 V's:

Volume: Massive data scale (terabytes to petabytes)

Velocity: High-speed data generation (real-time streams)

Variety: Diverse data types (structured, semi-structured, unstructured)

Veracity: Data quality and trustworthiness

Value: Business value from analysis

21.2 Hadoop Architecture

Hadoop Components:

HDFS (Hadoop Distributed File System):

Master/NameNode: Manages metadata
DataNodes: Store actual data
Block replication (default 3x)
Rack awareness

YARN (Yet Another Resource Negotiator):

ResourceManager: Cluster resource management
NodeManager: Per-node resource management
ApplicationMaster: Per-application coordination

HDFS Write Pipeline:

Client → NameNode (get block locations)
Client → DataNode1 (write block)
DataNode1 → DataNode2 (replicate)
DataNode2 → DataNode3 (replicate)

21.3 MapReduce

MapReduce processes data in parallel across cluster.

Map Phase:

Input split into chunks
Map function processes each record
Output intermediate key-value pairs

Shuffle Phase:

Sort and group by key
Transfer data to reducers

Reduce Phase:

Process grouped values
Output final results

Word Count Example:

// Mapper
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    
    public void map(LongWritable key, Text value, Context context) {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            context.write(word, one);
        }
    }
}

// Reducer
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context) {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}

21.4 Spark

Spark provides in-memory cluster computing.

RDD (Resilient Distributed Dataset):

Immutable distributed collection
Partitioned across cluster
Can be rebuilt if lost
Two types: transformations and actions

Spark SQL:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Example") \
    .getOrCreate()

# Read data
df = spark.read.json("sales.json")

# Transformations
result = df.filter(df.amount > 100) \
           .groupBy("product") \
           .agg({"amount": "sum"}) \
           .orderBy("sum(amount)", ascending=False)

# Actions
result.show()
result.write.parquet("output/")

Spark Streaming:

streaming_df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "sales-topic") \
    .load()

# Process in micro-batches
counts = streaming_df \
    .groupBy(window("timestamp", "1 hour"), "product") \
    .count()

query = counts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

21.5 Data Lakes

Data lakes store raw data in native format.

Characteristics:

Store all data (structured/unstructured)
Schema-on-read
Centralized repository
Supports various processing engines

Data Lake Architecture:

Data Sources → Ingestion Layer → Storage Layer → Processing → Consumption
    (IoT, logs, DBs)  (Kafka, Flume)   (HDFS/S3)   (Spark, Hive)  (BI, ML)

Data Lake vs Data Warehouse:

Aspect	Data Lake	Data Warehouse
Data	Raw, any format	Processed, structured
Schema	On read	On write
Users	Data scientists	Business analysts
Purpose	Exploration	Reporting
Storage	Cheap object storage	Expensive DBMS
Quality	Raw, may be messy	Clean, trusted

PART IX — Database Security

Chapter 22 — Database Security Fundamentals

22.1 Security Threats

External Threats:

SQL injection
Network eavesdropping
Unauthorized access
Denial of service

Internal Threats:

Privilege misuse
Data theft by employees
Accidental data exposure
Weak passwords

Environmental Threats:

Natural disasters
Power failures
Hardware failures
Backup compromise

22.2 Access Control Models

Discretionary Access Control (DAC):

Object owner controls access
GRANT/REVOKE privileges
SQL standard

Mandatory Access Control (MAC):

System-wide policies
Labels and classifications
Military/Government use

Role-Based Access Control (RBAC):

Permissions assigned to roles
Users assigned to roles
Most common in enterprises

RBAC Example:

-- Create roles
CREATE ROLE sales_read;
CREATE ROLE sales_write;
CREATE ROLE manager;

-- Grant permissions to roles
GRANT SELECT ON customers TO sales_read;
GRANT SELECT, INSERT, UPDATE ON orders TO sales_write;
GRANT ALL PRIVILEGES ON sales.* TO manager;

-- Assign users to roles
GRANT sales_read TO user_alice;
GRANT sales_write TO user_bob;
GRANT manager TO user_carol;

22.3 SQL Injection Attacks

SQL injection inserts malicious SQL through application input.

Vulnerable Code:

# Dangerous - don't do this
user_input = request.GET['username']
query = f"SELECT * FROM users WHERE username = '{user_input}'"
cursor.execute(query)

# Attacker input: ' OR '1'='1
# Result: SELECT * FROM users WHERE username = '' OR '1'='1'
# Returns all users!

Prevention:

# Safe - parameterized query
query = "SELECT * FROM users WHERE username = %s"
cursor.execute(query, (user_input,))

Additional Defenses:

Input validation
Stored procedures
Least privilege accounts
Web application firewalls

22.4 Encryption (TDE)

Transparent Data Encryption protects data at rest.

TDE Architecture:

Master Key → protects → Database Encryption Key → protects → Data

TDE Implementation (SQL Server):

-- Create master key
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'StrongPassword123!';

-- Create certificate
CREATE CERTIFICATE TDECert WITH SUBJECT = 'TDE Certificate';

-- Create database encryption key
USE database;
CREATE DATABASE ENCRYPTION KEY
WITH ALGORITHM = AES_256
ENCRYPTION BY SERVER CERTIFICATE TDECert;

-- Enable encryption
ALTER DATABASE database SET ENCRYPTION ON;

Column-Level Encryption:

-- PostgreSQL with pgcrypto
CREATE EXTENSION pgcrypto;

INSERT INTO users (username, ssn)
VALUES ('alice', pgp_sym_encrypt('123-45-6789', 'encryption-key'));

SELECT username, 
       pgp_sym_decrypt(ssn::bytea, 'encryption-key') AS ssn
FROM users;

22.5 Auditing and Logging

Auditing tracks database activities for security and compliance.

Audit Configuration:

-- SQL Server Audit
CREATE SERVER AUDIT SecurityAudit
TO FILE (FILEPATH = 'C:\Audit\');

CREATE DATABASE AUDIT SPECIFICATION UserActivity
FOR SERVER AUDIT SecurityAudit
ADD (SELECT, INSERT, UPDATE, DELETE ON SCHEMA::dbo BY public);

ALTER DATABASE AUDIT SPECIFICATION UserActivity WITH (STATE = ON);

-- Oracle Fine-Grained Auditing
BEGIN
  DBMS_FGA.ADD_POLICY(
    object_schema => 'HR',
    object_name => 'EMPLOYEES',
    policy_name => 'audit_salary',
    audit_condition => 'salary > 100000',
    audit_column => 'SALARY',
    handler_schema => 'SECURITY',
    handler_module => 'log_salary_access',
    enable => TRUE
  );
END;

Audit Log Analysis:

-- Query audit logs
SELECT 
    event_time,
    session_id,
    database_principal_name,
    statement
FROM sys.fn_get_audit_file('C:\Audit\*.sqlaudit', DEFAULT, DEFAULT)
WHERE event_time > '2024-01-01'
ORDER BY event_time DESC;

Chapter 23 — Privacy and Compliance

23.1 GDPR Overview

GDPR (General Data Protection Regulation) protects EU citizens' data.

Key Principles:

Lawful processing
Purpose limitation
Data minimization
Accuracy
Storage limitation
Integrity and confidentiality
Accountability

Data Subject Rights:

Right to access
Right to rectification
Right to erasure (right to be forgotten)
Right to restrict processing
Right to data portability
Right to object

Technical Implementation:

-- Right to be forgotten
CREATE PROCEDURE forget_user(IN user_id_param INT)
BEGIN
    START TRANSACTION;
    
    -- Anonymize personal data
    UPDATE users 
    SET email = CONCAT('deleted-', user_id, '@anonymous.com'),
        name = 'Deleted User',
        phone = NULL
    WHERE user_id = user_id_param;
    
    -- Delete sensitive data from other tables
    DELETE FROM user_activity_log WHERE user_id = user_id_param;
    DELETE FROM marketing_preferences WHERE user_id = user_id_param;
    
    -- Log the deletion for audit
    INSERT INTO deletion_log (user_id, deletion_date, reason)
    VALUES (user_id_param, NOW(), 'GDPR request');
    
    COMMIT;
END;

23.2 HIPAA

HIPAA (Health Insurance Portability and Accountability Act) protects health information.

Protected Health Information (PHI):

Names
Dates (except year)
Phone numbers
Email addresses
SSN
Medical record numbers
Health plan numbers
Account numbers
Certificate/license numbers
Vehicle identifiers
Device identifiers
URLs/IP addresses
Biometric identifiers
Photos
Any unique identifying number/code

HIPAA Technical Safeguards:

-- Access control
CREATE ROLE healthcare_provider;
CREATE ROLE billing_staff;

GRANT SELECT ON patients_demographics TO healthcare_provider;
GRANT SELECT ON patients_insurance TO billing_staff;

-- Audit logs
CREATE TABLE access_audit (
    access_id INT PRIMARY KEY AUTO_INCREMENT,
    user_id VARCHAR(100) NOT NULL,
    patient_id INT NOT NULL,
    access_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    action VARCHAR(50) NOT NULL,
    reason_for_access VARCHAR(500),
    ip_address VARCHAR(45)
);

-- Trigger for audit
CREATE TRIGGER audit_patient_access
AFTER SELECT ON patients_demographics
FOR EACH ROW
BEGIN
    INSERT INTO access_audit (user_id, patient_id, action, ip_address)
    VALUES (SESSION_USER(), NEW.patient_id, 'VIEW', CONNECTION_ID());
END;

23.3 Data Masking

Data masking hides sensitive data from unauthorized users.

Static Data Masking: Permanent replacement of sensitive data in non-production environments.

Dynamic Data Masking: Real-time masking based on user permissions.

Dynamic Masking Examples:

-- SQL Server Dynamic Data Masking
CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(100) MASKED WITH (FUNCTION = 'partial(2, "XXXX", 0)'),
    email VARCHAR(100) MASKED WITH (FUNCTION = 'email()'),
    phone VARCHAR(20) MASKED WITH (FUNCTION = 'partial(0, "XXX-XXX-", 4)'),
    ssn CHAR(11) MASKED WITH (FUNCTION = 'default()')
);

-- Oracle Data Redaction
BEGIN
  DBMS_REDACT.ADD_POLICY(
    object_schema => 'HR',
    object_name => 'EMPLOYEES',
    policy_name => 'redact_ssn',
    column_name => 'SSN',
    function_type => DBMS_REDACT.PARTIAL,
    function_parameters => 'VVVFVVFVVVV,VVV-XX-XXXX,*,1,5',
    expression => '1=1'
  );
END;

23.4 Secure Database Design

Security by Design Principles:

Least Privilege: Grant minimum necessary permissions
Defense in Depth: Multiple security layers
Fail Secure: Default to denied
Separation of Duties: No single user has all power
Secure Defaults: Start secure, open only when needed

Secure Schema Design:

-- Separate sensitive data
CREATE TABLE public_user_info (
    user_id INT PRIMARY KEY,
    username VARCHAR(100) NOT NULL,
    display_name VARCHAR(200)
);

CREATE TABLE private_user_data (
    user_id INT PRIMARY KEY REFERENCES public_user_info(user_id),
    email_hash CHAR(64) NOT NULL, -- Hash for lookups
    email_encrypted TEXT NOT NULL, -- Encrypted for display
    ssn_encrypted TEXT,
    -- Additional sensitive data
);

-- Views for controlled access
CREATE VIEW user_contact AS
SELECT 
    u.user_id,
    u.display_name,
    CASE 
        WHEN CURRENT_ROLE() IN ('admin', 'support') 
        THEN pgp_sym_decrypt(p.email_encrypted, 'key')
        ELSE '***REDACTED***'
    END AS email
FROM public_user_info u
LEFT JOIN private_user_data p ON u.user_id = p.user_id;

Security Checklist:

PART X — Emerging Topics

Chapter 24 — In-Memory Databases

Redis

Redis is an in-memory data structure store.

Data Structures:

# Strings
SET user:1001:name "Alice"
GET user:1001:name

# Lists (queues/stacks)
LPUSH tasks "job1"
RPUSH tasks "job2"
LPOP tasks

# Sets
SADD friends:1001 1002 1003 1004
SMEMBERS friends:1001
SISMEMBER friends:1001 1002

# Hashes
HSET user:1001 name "Alice" age 30 email "alice@example.com"
HGETALL user:1001

# Sorted Sets
ZADD leaderboard 100 "player1"
ZADD leaderboard 85 "player2"
ZRANGE leaderboard 0 10 WITHSCORES

# Geospatial
GEOADD locations 13.361389 38.115556 "Palermo"
GEODIST locations "Palermo" "Catania" km

Persistence Options:

RDB (snapshots): Point-in-time dumps
AOF (Append Only File): Log every write
No persistence: Pure cache

Use Cases:

Session store
Real-time leaderboards
Message queues
Rate limiting
Caching layer

SAP HANA

SAP HANA combines row and column stores in memory.

Features:

Columnar storage for analytics
Row store for OLTP
Advanced compression
Multi-core processing
SQL and Graph support

Column Store Example:

-- Create column table
CREATE COLUMN TABLE sales (
    id INTEGER PRIMARY KEY,
    product_id INTEGER,
    customer_id INTEGER,
    amount DECIMAL(10,2),
    sale_date DATE
);

-- In-memory processing
SELECT 
    product_id,
    SUM(amount) AS total_sales
FROM sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY product_id;

Chapter 25 — Blockchain Databases

25.1 Distributed Ledger Concepts

Characteristics:

Decentralized
Immutable
Append-only
Cryptographically verified
Consensus-based

Consensus Mechanisms:

Proof of Work (PoW):

Solve computational puzzle
Energy intensive
Used by Bitcoin

Proof of Stake (PoS):

Validators stake cryptocurrency
Energy efficient
Used by Ethereum 2.0

Practical Byzantine Fault Tolerance (PBFT):

Voting-based consensus
Permissioned blockchains
High performance

25.2 Smart Contracts

Self-executing contracts with terms directly written in code.

Ethereum Smart Contract (Solidity):

pragma solidity ^0.8.0;

contract SimpleBank {
    mapping(address => uint) private balances;
    
    event Deposit(address indexed account, uint amount);
    event Withdrawal(address indexed account, uint amount);
    
    function deposit() public payable {
        balances[msg.sender] += msg.value;
        emit Deposit(msg.sender, msg.value);
    }
    
    function withdraw(uint amount) public {
        require(balances[msg.sender] >= amount, "Insufficient balance");
        
        balances[msg.sender] -= amount;
        payable(msg.sender).transfer(amount);
        
        emit Withdrawal(msg.sender, amount);
    }
    
    function getBalance() public view returns (uint) {
        return balances[msg.sender];
    }
}

25.3 Database vs Blockchain

Aspect	Traditional DB	Blockchain
Control	Centralized	Decentralized
Write	Update/Delete	Append-only
Trust	Trust administrator	Trust math/cryptography
Performance	High (thousands TPS)	Low (tens TPS)
Cost	Operational	Transaction fees
Use Case	General purpose	Trustless transactions

Chapter 26 — AI & Databases

26.1 Vector Databases

Vector databases store and query embeddings for AI applications.

Characteristics:

Store high-dimensional vectors
Efficient similarity search
Indexing for nearest neighbors
Integration with ML models

Popular Vector Databases:

Pinecone
Weaviate
Milvus
Qdrant

Example (Pinecone):

import pinecone

# Initialize
pinecone.init(api_key="your-api-key")
index = pinecone.Index("embeddings")

# Insert vectors
vectors = [
    ("id1", [0.1, 0.2, 0.3, ...], {"metadata": "text1"}),
    ("id2", [0.4, 0.5, 0.6, ...], {"metadata": "text2"})
]
index.upsert(vectors)

# Query similar vectors
results = index.query(
    vector=[0.15, 0.25, 0.35, ...],
    top_k=5,
    include_metadata=True
)

26.2 Embedding Storage

Store embeddings alongside traditional data.

PostgreSQL with pgvector:

-- Enable extension
CREATE EXTENSION vector;

-- Create table with embeddings
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT,
    embedding vector(1536)  -- OpenAI embedding dimension
);

-- Create index for similarity search
CREATE INDEX ON documents 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Insert with embedding
INSERT INTO documents (content, embedding)
VALUES ('Sample text', '[0.1,0.2,0.3,...]');

-- Similarity search
SELECT content, 1 - (embedding <=> '[0.15,0.25,0.35,...]') AS similarity
FROM documents
ORDER BY embedding <=> '[0.15,0.25,0.35,...]'
LIMIT 10;

26.3 Similarity Search

Distance Metrics:

Cosine similarity: Angle between vectors
Euclidean distance: Straight-line distance
Dot product: Projection similarity

Approximate Nearest Neighbor (ANN) Algorithms:

HNSW (Hierarchical Navigable Small World): Graph-based
IVF (Inverted File Index): Cluster-based
PQ (Product Quantization): Compression-based

26.4 AI-Augmented Query Optimization

Machine learning improves query optimization.

Learned Indexes: Replace traditional indexes with neural networks that learn data distribution.

Query Prediction: Predict query patterns to pre-optimize.

Cardinality Estimation: ML models estimate result sizes more accurately.

Example - Learned Cardinality:

-- Traditional: based on statistics
Estimated: 10,000 rows
Actual: 15,000 rows

-- ML model: learns from query history
Estimated: 14,800 rows
Actual: 15,000 rows  -- Much closer!

PART XI — Practical Database Engineering

Chapter 27 — Performance Tuning

27.1 Query Profiling

Identifying Slow Queries:

-- PostgreSQL slow query log
log_min_duration_statement = 1000  -- Log queries > 1 second

-- MySQL slow query log
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 1;

-- Find current slow queries
SELECT * FROM pg_stat_activity 
WHERE state = 'active' AND now() - query_start > interval '1 minute';

-- Analyze query performance
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large_table WHERE condition;

27.2 Index Optimization

Index Maintenance:

-- Find unused indexes
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0;

-- Find duplicate indexes
SELECT pg_size_pretty(SUM(pg_relation_size(idx))::BIGINT) AS total,
       array_agg(idx) AS indexes
FROM (
    SELECT indexrelid::regclass AS idx,
           indrelid,
           array_agg(attname) AS cols
    FROM pg_index i
    JOIN pg_attribute a ON a.attrelid = i.indrelid 
        AND a.attnum = ANY(i.indkey)
    WHERE i.indisunique = false
    GROUP BY i.indexrelid, i.indrelid, i.indkey
) t
GROUP BY indrelid, cols
HAVING COUNT(*) > 1;

-- Rebuild fragmented indexes
REINDEX INDEX idx_name;

27.3 Caching Strategies

Database Caching:

-- PostgreSQL shared buffers
SHOW shared_buffers;  -- Typically 25% of RAM

-- MySQL InnoDB buffer pool
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';

-- Query cache (deprecated in modern versions)
-- Use application-level caching instead

Application-Level Caching (Redis):

import redis
import json

cache = redis.Redis(host='localhost', port=6379, db=0)

def get_user(user_id):
    # Try cache first
    cached = cache.get(f"user:{user_id}")
    if cached:
        return json.loads(cached)
    
    # Cache miss - query database
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    
    # Store in cache (expires in 1 hour)
    cache.setex(f"user:{user_id}", 3600, json.dumps(user))
    return user

27.4 Scaling Strategies

Vertical Scaling (Scale Up):

More CPU, RAM, faster storage
Simpler, no application changes
Hardware limits

Horizontal Scaling (Scale Out):

Read Replicas:

-- Master handles writes
-- Replicas handle reads
SELECT * FROM orders;  -- Goes to replica
INSERT INTO orders ...;  -- Goes to master

Sharding:

Application-level sharding:

def get_shard(user_id):
    shard_id = user_id % 4
    return shard_connections[shard_id]

Database-level sharding (CockroachDB):

-- Automatic sharding
SHOW RANGES FROM TABLE users;

Connection Pooling:

# SQLAlchemy connection pool
engine = create_engine(
    'postgresql://user:pass@localhost/db',
    pool_size=20,
    max_overflow=10,
    pool_pre_ping=True
)

Chapter 28 — Database DevOps

28.1 CI/CD for Databases

Principles:

Version control for schema
Automated migrations
Testing in CI pipeline
Rollback capability
Zero-downtime deployments

Migration Tools:

Flyway
Liquibase
Alembic (Python)
ActiveRecord Migrations (Rails)

28.2 Schema Migration Tools

Flyway Example:

-- V1__create_users_table.sql
CREATE TABLE users (
    id INT PRIMARY KEY AUTO_INCREMENT,
    email VARCHAR(200) NOT NULL UNIQUE,
    name VARCHAR(200) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- V2__add_password_hash.sql
ALTER TABLE users ADD COLUMN password_hash CHAR(60) NOT NULL;

-- V3__create_orders_table.sql
CREATE TABLE orders (
    id INT PRIMARY KEY AUTO_INCREMENT,
    user_id INT NOT NULL,
    amount DECIMAL(10,2) NOT NULL,
    status VARCHAR(50) DEFAULT 'pending',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES users(id)
);

Migration Commands:

# Apply migrations
flyway migrate

# Check status
flyway info

# Validate checksums
flyway validate

# Repair failed migration
flyway repair

28.3 Backup Automation

Automated Backup Script:

#!/bin/bash
# daily_backup.sh

BACKUP_DIR="/backup/$(date +%Y-%m-%d)"
mkdir -p $BACKUP_DIR

# Full database backup
pg_dump -h localhost -U dbadmin mydb > $BACKUP_DIR/full.sql

# Archive and compress
tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR

# Upload to cloud storage
aws s3 cp $BACKUP_DIR.tar.gz s3://my-backups/

# Clean old backups (keep 30 days)
find /backup -type d -mtime +30 -exec rm -rf {} \;

# Test backup integrity
pg_restore --list $BACKUP_DIR/full.sql > /dev/null
if [ $? -eq 0 ]; then
    echo "Backup valid" | mail -s "Backup Success" dba@example.com
else
    echo "Backup corrupted" | mail -s "BACKUP FAILED" dba@example.com
fi

28.4 Monitoring Tools

Prometheus + Grafana Setup:

# prometheus.yml
scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: ['localhost:9187']  # PostgreSQL exporter
  - job_name: 'mysql'
    static_configs:
      - targets: ['localhost:9104']  # MySQL exporter

Key Metrics:

Query latency (p95, p99)
Connection count
Cache hit ratio
Replication lag
Disk I/O
CPU/Memory usage

Alerting Rules:

groups:
  - name: database_alerts
    rules:
      - alert: HighQueryLatency
        expr: pg_stat_activity_max_tx_duration > 300
        for: 5m
        annotations:
          summary: "High query latency detected"
          
      - alert: LowCacheHitRatio
        expr: (pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read)) < 0.95
        for: 10m
        annotations:
          summary: "Cache hit ratio below 95%"

Chapter 29 — Real-World Case Studies

29.1 Large-Scale Banking Systems

Requirements:

100% ACID compliance
Zero data loss
High availability (99.999%)
Regulatory compliance
Fraud detection
Real-time balances

Architecture:

Frontend → Load Balancer → Application Servers → Database Cluster
                           ↓                          ↓
                      Cache Layer                 Read Replicas
                      (Redis)                      (Reporting)

Database Design:

-- Account table with optimistic locking
CREATE TABLE accounts (
    account_id BIGINT PRIMARY KEY,
    account_number VARCHAR(20) UNIQUE NOT NULL,
    customer_id BIGINT NOT NULL,
    account_type VARCHAR(20) NOT NULL,
    balance DECIMAL(15,2) NOT NULL DEFAULT 0.00,
    currency CHAR(3) NOT NULL DEFAULT 'USD',
    status VARCHAR(20) NOT NULL DEFAULT 'active',
    version INT NOT NULL DEFAULT 0,  -- For optimistic locking
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Transaction table (append-only)
CREATE TABLE transactions (
    transaction_id BIGINT PRIMARY KEY,
    account_id BIGINT NOT NULL REFERENCES accounts(account_id),
    transaction_type VARCHAR(20) NOT NULL,
    amount DECIMAL(15,2) NOT NULL,
    balance_after DECIMAL(15,2) NOT NULL,
    status VARCHAR(20) NOT NULL DEFAULT 'pending',
    reference_number VARCHAR(50) UNIQUE,
    description VARCHAR(500),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_account_created (account_id, created_at DESC)
) PARTITION BY RANGE (YEAR(created_at));

-- Transfer funds stored procedure with locking
DELIMITER //
CREATE PROCEDURE TransferFunds(
    IN from_account BIGINT,
    IN to_account BIGINT,
    IN transfer_amount DECIMAL(15,2),
    IN reference VARCHAR(50)
)
BEGIN
    DECLARE EXIT HANDLER FOR SQLEXCEPTION
    BEGIN
        ROLLBACK;
        RESIGNAL;
    END;
    
    START TRANSACTION;
    
    -- Lock accounts in consistent order to prevent deadlocks
    IF from_account < to_account THEN
        SELECT balance INTO @balance FROM accounts 
        WHERE account_id = from_account FOR UPDATE;
        
        SELECT balance INTO @balance FROM accounts 
        WHERE account_id = to_account FOR UPDATE;
    ELSE
        SELECT balance INTO @balance FROM accounts 
        WHERE account_id = to_account FOR UPDATE;
        
        SELECT balance INTO @balance FROM accounts 
        WHERE account_id = from_account FOR UPDATE;
    END IF;
    
    -- Check sufficient funds
    SELECT balance INTO @from_balance FROM accounts 
    WHERE account_id = from_account;
    
    IF @from_balance < transfer_amount THEN
        SIGNAL SQLSTATE '45000' 
        SET MESSAGE_TEXT = 'Insufficient funds';
    END IF;
    
    -- Update balances
    UPDATE accounts 
    SET balance = balance - transfer_amount,
        version = version + 1
    WHERE account_id = from_account;
    
    UPDATE accounts 
    SET balance = balance + transfer_amount,
        version = version + 1
    WHERE account_id = to_account;
    
    -- Record transactions
    INSERT INTO transactions (
        account_id, transaction_type, amount, 
        balance_after, reference_number, description
    ) VALUES (
        from_account, 'TRANSFER_OUT', -transfer_amount,
        (SELECT balance FROM accounts WHERE account_id = from_account),
        reference, CONCAT('Transfer to ', to_account)
    );
    
    INSERT INTO transactions (
        account_id, transaction_type, amount, 
        balance_after, reference_number, description
    ) VALUES (
        to_account, 'TRANSFER_IN', transfer_amount,
        (SELECT balance FROM accounts WHERE account_id = to_account),
        reference, CONCAT('Transfer from ', from_account)
    );
    
    COMMIT;
END//
DELIMITER ;

High Availability Setup:

-- Synchronous replication
-- Primary and standby with automatic failover

-- Primary configuration (postgresql.conf)
synchronous_commit = on
synchronous_standby_names = 'standby1'

-- Standby configuration (recovery.conf)
primary_conninfo = 'host=primary port=5432 user=replicator'
recovery_target_timeline = 'latest'

29.2 Social Media Architecture

Requirements:

1B+ users
100M+ daily active users
Real-time feeds
Graph relationships
High write throughput
Eventually consistent reads

Database Mix:

User data: PostgreSQL (sharded)
Posts: Cassandra (wide rows)
Relationships: Neo4j (graph)
Feeds: Redis (timelines)
Activity logs: HBase (time-series)

Post Storage (Cassandra):

CREATE TABLE posts (
    user_id UUID,
    post_id TIMEUUID,
    content TEXT,
    media_urls LIST<TEXT>,
    likes_count INT,
    comments_count INT,
    shares_count INT,
    created_at TIMESTAMP,
    PRIMARY KEY (user_id, created_at, post_id)
) WITH CLUSTERING ORDER BY (created_at DESC, post_id DESC);

-- Get user's posts
SELECT * FROM posts 
WHERE user_id = ? 
LIMIT 20;

-- Time-based partitioning
CREATE TABLE posts_by_time (
    bucket INT,  -- e.g., YYYYMMDD
    post_id TIMEUUID,
    user_id UUID,
    content TEXT,
    PRIMARY KEY (bucket, post_id)
);

Feed Generation:

def generate_feed(user_id, limit=50):
    # Get followed users from graph DB
    followed = graph.run("""
        MATCH (u:User {id: $user_id})-[:FOLLOWS]->(f:User)
        RETURN f.id
    """, user_id=user_id).values()
    
    # Get recent posts from Cassandra
    posts = []
    for followed_id in followed:
        recent = session.execute("""
            SELECT * FROM posts 
            WHERE user_id = %s 
            ORDER BY created_at DESC 
            LIMIT 10
        """, [followed_id])
        posts.extend(recent)
    
    # Sort by time and cache in Redis
    posts.sort(key=lambda x: x.created_at, reverse=True)
    redis_client.setex(
        f"feed:{user_id}",
        300,  # 5 minute cache
        json.dumps(posts[:limit])
    )
    
    return posts[:limit]

29.3 E-Commerce Platforms

Requirements:

Product catalog
Shopping cart
Order processing
Inventory management
Recommendation engine
Seasonal spikes (Black Friday)

Key Challenges:

Handle traffic spikes
Prevent overselling
Real-time inventory
Personalized recommendations

Inventory Management:

-- Inventory table with optimistic locking
CREATE TABLE inventory (
    product_id INT PRIMARY KEY,
    sku VARCHAR(50) UNIQUE NOT NULL,
    quantity INT NOT NULL CHECK (quantity >= 0),
    reserved INT NOT NULL DEFAULT 0,
    reorder_point INT,
    version INT NOT NULL DEFAULT 0,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Reserve inventory with optimistic locking
CREATE PROCEDURE ReserveInventory(
    IN p_product_id INT,
    IN p_quantity INT,
    IN p_order_id INT
)
BEGIN
    DECLARE v_available INT;
    DECLARE v_version INT;
    DECLARE EXIT HANDLER FOR SQLEXCEPTION
    BEGIN
        ROLLBACK;
        RESIGNAL;
    END;
    
    START TRANSACTION;
    
    -- Check availability
    SELECT quantity - reserved, version 
    INTO v_available, v_version
    FROM inventory 
    WHERE product_id = p_product_id
    FOR UPDATE;
    
    IF v_available < p_quantity THEN
        SIGNAL SQLSTATE '45000' 
        SET MESSAGE_TEXT = 'Insufficient inventory';
    END IF;
    
    -- Update with version check
    UPDATE inventory 
    SET reserved = reserved + p_quantity,
        version = version + 1
    WHERE product_id = p_product_id 
      AND version = v_version;
    
    -- Record reservation
    INSERT INTO inventory_reservations (
        order_id, product_id, quantity, reserved_at
    ) VALUES (
        p_order_id, p_product_id, p_quantity, NOW()
    );
    
    COMMIT;
END;

Order Processing with Saga Pattern:

def create_order(user_id, items):
    try:
        # Step 1: Create order record
        order_id = create_order_record(user_id, items)
        
        # Step 2: Reserve inventory for each item
        for item in items:
            reserve_inventory(item.product_id, item.quantity, order_id)
        
        # Step 3: Process payment
        payment_id = process_payment(user_id, calculate_total(items))
        
        # Step 4: Update order status
        confirm_order(order_id, payment_id)
        
        # Step 5: Trigger fulfillment
        queue_fulfillment(order_id)
        
        return order_id
        
    except Exception as e:
        # Compensation: Rollback completed steps
        if 'order_id' in locals():
            cancel_order(order_id)
        if 'payment_id' in locals():
            refund_payment(payment_id)
        raise

29.4 High-Frequency Trading Systems

Requirements:

Microsecond latency
100% data accuracy
No downtime
Complex risk checks
Regulatory reporting
Audit trails

Architecture:

Market Data → Order Gateway → Risk Engine → Order Manager → Exchange
                 ↑                              ↓
              Client API                    Trade Database

Database Design:

-- In-memory tables for speed
CREATE TABLE positions (
    account_id INT,
    symbol VARCHAR(10),
    quantity INT,
    avg_price DECIMAL(10,4),
    last_updated TIMESTAMP,
    PRIMARY KEY (account_id, symbol)
) ENGINE=MEMORY;

-- Persistent storage for compliance
CREATE TABLE trades (
    trade_id BIGINT PRIMARY KEY,
    order_id BIGINT NOT NULL,
    account_id INT NOT NULL,
    symbol VARCHAR(10) NOT NULL,
    side ENUM('BUY', 'SELL') NOT NULL,
    quantity INT NOT NULL,
    price DECIMAL(10,4) NOT NULL,
    trade_time TIMESTAMP(6) NOT NULL,  -- Microsecond precision
    exchange VARCHAR(50) NOT NULL,
    compliance_check BOOLEAN DEFAULT FALSE,
    INDEX idx_account_time (account_id, trade_time DESC)
) PARTITION BY RANGE (TO_DAYS(trade_time));

-- Risk limits
CREATE TABLE risk_limits (
    account_id INT PRIMARY KEY,
    max_position_size INT,
    max_daily_loss DECIMAL(10,2),
    max_order_value DECIMAL(10,2),
    allowed_symbols JSON,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Order Validation:

CREATE PROCEDURE ValidateOrder(
    IN p_account_id INT,
    IN p_symbol VARCHAR(10),
    IN p_side VARCHAR(4),
    IN p_quantity INT,
    IN p_price DECIMAL(10,4)
)
BEGIN
    DECLARE v_current_position INT;
    DECLARE v_max_position INT;
    DECLARE v_daily_loss DECIMAL(10,2);
    DECLARE v_allowed BOOLEAN;
    
    -- Check risk limits (in microseconds)
    SELECT max_position_size, max_daily_loss, allowed_symbols
    INTO v_max_position, v_daily_loss, v_allowed
    FROM risk_limits
    WHERE account_id = p_account_id
    FOR UPDATE;  -- Lock to prevent concurrent overrides
    
    -- Check symbol allowed
    IF NOT JSON_CONTAINS(v_allowed, JSON_QUOTE(p_symbol)) THEN
        SIGNAL SQLSTATE '45000' 
        SET MESSAGE_TEXT = 'Symbol not allowed';
    END IF;
    
    -- Check position limits
    SELECT SUM(IF(side='BUY', quantity, -quantity))
    INTO v_current_position
    FROM trades
    WHERE account_id = p_account_id 
      AND symbol = p_symbol
      AND trade_time >= DATE_SUB(NOW(), INTERVAL 1 DAY);
    
    IF ABS(v_current_position + 
           IF(p_side='BUY', p_quantity, -p_quantity)) > v_max_position THEN
        SIGNAL SQLSTATE '45000' 
        SET MESSAGE_TEXT = 'Position limit exceeded';
    END IF;
    
    -- Log validation
    INSERT INTO order_validation_log (
        account_id, symbol, side, quantity, price,
        validation_time, validation_result
    ) VALUES (
        p_account_id, p_symbol, p_side, p_quantity, p_price,
        NOW(6), 'PASSED'
    );
END;

Appendices

Appendix A: Relational Algebra Formal Proofs

Proof of Equivalence: σ_cond1(σ_cond2(R)) = σ_cond2(σ_cond1(R))

Let t be a tuple in σ_cond1(σ_cond2(R))
⇔ t ∈ R and cond2(t) true and cond1(t) true
⇔ t ∈ R and cond1(t) true and cond2(t) true
⇔ t ∈ σ_cond2(σ_cond1(R))

Therefore, selection operations commute.

Proof of Push Selection through Join: σ_cond(R ⋈ S) = σ_cond(R) ⋈ S, if cond references only R

Let t be a tuple in σ_cond(R ⋈ S)
⇔ ∃ r ∈ R, s ∈ S such that t = r ∪ s and cond(t) true
⇔ ∃ r ∈ σ_cond(R), s ∈ S such that t = r ∪ s
⇔ t ∈ σ_cond(R) ⋈ S

Therefore, selection pushes through join when condition references only one relation.

Appendix B: SQL Reference Manual

Common SQL Functions:

Function	Description	Example
COUNT(*)	Count rows	SELECT COUNT(*) FROM users
SUM(col)	Sum values	SELECT SUM(amount) FROM orders
AVG(col)	Average	SELECT AVG(price) FROM products
MIN(col)	Minimum	SELECT MIN(salary) FROM employees
MAX(col)	Maximum	SELECT MAX(age) FROM users
CONCAT	String join	SELECT CONCAT(first, ' ', last)
SUBSTRING	Extract part	SELECT SUBSTRING(email, 1, 5)
COALESCE	First non-null	SELECT COALESCE(phone, email)
DATE_ADD	Add to date	SELECT DATE_ADD(date, INTERVAL 1 DAY)
DATEDIFF	Date difference	SELECT DATEDIFF(date1, date2)

Common SQL Patterns:

-- Pagination
SELECT * FROM users 
ORDER BY id 
LIMIT 10 OFFSET 20;

-- Insert or update
INSERT INTO users (id, name) VALUES (1, 'Alice')
ON DUPLICATE KEY UPDATE name = VALUES(name);

-- Find duplicates
SELECT email, COUNT(*) 
FROM users 
GROUP BY email 
HAVING COUNT(*) > 1;

-- Running total
SELECT 
    date,
    amount,
    SUM(amount) OVER (ORDER BY date) AS running_total
FROM sales;

-- Top N per group
SELECT * FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
    FROM employees
) ranked
WHERE rn <= 3;

Appendix C: Sample Database Projects

Project 1: Library Management System

CREATE DATABASE library;
USE library;

CREATE TABLE books (
    book_id INT PRIMARY KEY AUTO_INCREMENT,
    isbn VARCHAR(20) UNIQUE NOT NULL,
    title VARCHAR(500) NOT NULL,
    author VARCHAR(200) NOT NULL,
    publisher VARCHAR(200),
    publication_year INT,
    category VARCHAR(100),
    total_copies INT NOT NULL DEFAULT 1,
    available_copies INT NOT NULL DEFAULT 1,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE members (
    member_id INT PRIMARY KEY AUTO_INCREMENT,
    card_number VARCHAR(20) UNIQUE NOT NULL,
    first_name VARCHAR(100) NOT NULL,
    last_name VARCHAR(100) NOT NULL,
    email VARCHAR(200) UNIQUE NOT NULL,
    phone VARCHAR(20),
    address TEXT,
    membership_date DATE NOT NULL,
    membership_expiry DATE NOT NULL,
    status VARCHAR(20) DEFAULT 'active'
);

CREATE TABLE loans (
    loan_id INT PRIMARY KEY AUTO_INCREMENT,
    book_id INT NOT NULL REFERENCES books(book_id),
    member_id INT NOT NULL REFERENCES members(member_id),
    loan_date DATE NOT NULL,
    due_date DATE NOT NULL,
    return_date DATE,
    status VARCHAR(20) DEFAULT 'borrowed',
    fine_amount DECIMAL(5,2) DEFAULT 0.00,
    INDEX idx_dates (due_date, status)
);

CREATE TABLE fines (
    fine_id INT PRIMARY KEY AUTO_INCREMENT,
    loan_id INT NOT NULL REFERENCES loans(loan_id),
    amount DECIMAL(5,2) NOT NULL,
    paid BOOLEAN DEFAULT FALSE,
    paid_date DATE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Stored procedure for borrowing
DELIMITER //
CREATE PROCEDURE BorrowBook(
    IN p_book_id INT,
    IN p_member_id INT
)
BEGIN
    DECLARE v_available INT;
    DECLARE v_member_status VARCHAR(20);
    
    START TRANSACTION;
    
    -- Check member status
    SELECT status INTO v_member_status
    FROM members WHERE member_id = p_member_id;
    
    IF v_member_status != 'active' THEN
        SIGNAL SQLSTATE '45000' 
        SET MESSAGE_TEXT = 'Member not active';
    END IF;
    
    -- Check availability
    SELECT available_copies INTO v_available
    FROM books WHERE book_id = p_book_id
    FOR UPDATE;
    
    IF v_available < 1 THEN
        SIGNAL SQLSTATE '45000' 
        SET MESSAGE_TEXT = 'No copies available';
    END IF;
    
    -- Update book count
    UPDATE books 
    SET available_copies = available_copies - 1
    WHERE book_id = p_book_id;
    
    -- Create loan
    INSERT INTO loans (book_id, member_id, loan_date, due_date)
    VALUES (p_book_id, p_member_id, CURDATE(), DATE_ADD(CURDATE(), INTERVAL 14 DAY));
    
    COMMIT;
END//
DELIMITER ;

Appendix D: Interview Questions

Basic Level:

What is the difference between DELETE and TRUNCATE?
Explain primary key vs unique key.
What are the ACID properties?
Write a query to find second highest salary.
Explain different types of JOINs.

Intermediate Level:

What is normalization? Explain 1NF, 2NF, 3NF.
How do indexes work? When should you use them?
Explain the difference between clustered and non-clustered indexes.
What is a deadlock? How can you prevent it?
Write a query to find employees earning more than their managers.

Advanced Level:

Explain MVCC and how it enables concurrency.
How would you design a database for a social media platform?
What's the difference between vertical and horizontal partitioning?
Explain the CAP theorem and its implications for distributed databases.
How would you handle a database with billions of rows?

Practical Scenarios:

Your database is slow. How do you diagnose and fix it?
How do you handle a schema migration with zero downtime?
Design a system to handle Black Friday traffic spikes.
How do you ensure GDPR compliance in your database?
Your database crashed at 2 AM. What do you do?

Appendix E: Research Papers and Further Reading

Classic Papers:

Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks"
Gray, J. et al. (1981). "The Recovery Manager of the System R Database Manager"
Lamport, L. (1978). "Time, Clocks, and the Ordering of Events in a Distributed System"
Stonebraker, M. et al. (1976). "The Design and Implementation of INGRES"

Modern Papers:

DeCandia, G. et al. (2007). "Dynamo: Amazon's Highly Available Key-value Store"
Chang, F. et al. (2006). "Bigtable: A Distributed Storage System for Structured Data"
Corbett, J.C. et al. (2012). "Spanner: Google's Globally-Distributed Database"
Lakshman, A. & Malik, P. (2010). "Cassandra: A Decentralized Structured Storage System"

Books:

"Database System Concepts" - Silberschatz, Korth, Sudarshan
"Designing Data-Intensive Applications" - Martin Kleppmann
"SQL Performance Explained" - Markus Winand
"Database Internals" - Alex Petrov
"Readings in Database Systems" (The Red Book) - Peter Bailis et al.

Online Resources:

Use The Index, Luke! (use-the-index-luke.com)
High Scalability (highscalability.com)
PostgreSQL Documentation
MySQL Performance Blog
DB-Engines Ranking (db-engines.com)

aw-junaid/Database Systems.md

Database Systems: From Foundations to Advanced Data Engineering

PART I — Foundations of Database Systems

Chapter 1 — Introduction to Database Systems

Chapter 2 — Data Models and Data Modeling

Chapter 3 — Relational Data Model

PART II — SQL and Advanced Querying

Chapter 4 — SQL Fundamentals

Chapter 5 — Advanced SQL

Chapter 6 — Procedural SQL Extensions

PART III — Database Design and Normalization

Chapter 7 — Functional Dependencies

Chapter 8 — Normalization Theory

Chapter 9 — Database Design Methodology

PART IV — Transaction Management & Concurrency

Chapter 10 — Transaction Concepts

Chapter 11 — Concurrency Control

Chapter 12 — Database Recovery

PART V — Storage and Indexing

Chapter 13 — Storage Systems

Chapter 14 — Indexing Structures

PART VI — Query Processing & Optimization

Chapter 15 — Query Processing

Chapter 16 — Query Optimization

PART VII — Distributed and Cloud Databases

Chapter 17 — Distributed Databases

Chapter 18 — NoSQL Databases

Chapter 19 — NewSQL and Modern Databases

PART VIII — Big Data & Data Warehousing

Chapter 20 — Data Warehousing

Chapter 21 — Big Data Ecosystem

PART IX — Database Security

Chapter 22 — Database Security Fundamentals

Chapter 23 — Privacy and Compliance

PART X — Emerging Topics

Chapter 24 — In-Memory Databases

Chapter 25 — Blockchain Databases

Chapter 26 — AI & Databases

PART XI — Practical Database Engineering

Chapter 27 — Performance Tuning

Chapter 28 — Database DevOps

Chapter 29 — Real-World Case Studies

Appendices

📘 Database Systems: From Foundations to Advanced Data Engineering

PART I — Foundations of Database Systems

Chapter 1 — Introduction to Database Systems

1.1 Evolution of Data Management

1.2 File Systems vs Database Systems

1.3 Characteristics of Database Systems

1.4 Data Models Overview

1.5 Database Users and Roles

1.6 Database Architecture (ANSI/SPARC)

1.7 Three-Level Architecture

1.8 Data Independence

1.9 Database System Environment

1.10 Overview of Modern DBMS Examples

Chapter 2 — Data Models and Data Modeling

2.1 Concept of Data Models

2.2 Object-Based Logical Models

2.3 Record-Based Logical Models

2.4 Physical Data Models

2.5 Conceptual vs Logical vs Physical Design

2.6 Abstraction and Generalization

2.7 Entity-Relationship (ER) Model

2.8 Extended ER (EER) Model

2.9 UML for Database Modeling

2.10 Case Study: Designing a University Database

Chapter 3 — Relational Data Model

3.1 Structure of Relational Databases

3.2 Relations, Tuples, Attributes

3.3 Domains and Constraints

3.4 Keys (Primary, Candidate, Super, Foreign)

3.5 Integrity Constraints

3.6 Relational Algebra

3.7 Relational Calculus (TRC & DRC)

3.8 Views and Relational Model

3.9 Null Values and Three-Valued Logic

PART II — SQL and Advanced Querying

Chapter 4 — SQL Fundamentals

4.1 History and Standardization