-
-
- 1.1 Evolution of Data Management
- 1.2 File Systems vs Database Systems
- 1.3 Characteristics of Database Systems
- 1.4 Data Models Overview
- 1.5 Database Users and Roles
- 1.6 Database Architecture (ANSI/SPARC)
- 1.7 Three-Level Architecture
- 1.8 Data Independence
- 1.9 Database System Environment
- 1.10 Overview of Modern DBMS Examples
- Oracle Database
- MySQL
- PostgreSQL
- Microsoft SQL Server
-
- 2.1 Concept of Data Models
- 2.2 Object-Based Logical Models
- 2.3 Record-Based Logical Models
- 2.4 Physical Data Models
- 2.5 Conceptual vs Logical vs Physical Design
- 2.6 Abstraction and Generalization
- 2.7 Entity-Relationship (ER) Model
- 2.8 Extended ER (EER) Model
- 2.9 UML for Database Modeling
- 2.10 Case Study: Designing a University Database
-
- 3.1 Structure of Relational Databases
- 3.2 Relations, Tuples, Attributes
- 3.3 Domains and Constraints
- 3.4 Keys (Primary, Candidate, Super, Foreign)
- 3.5 Integrity Constraints
- 3.6 Relational Algebra
- 3.7 Relational Calculus (TRC & DRC)
- 3.8 Views and Relational Model
- 3.9 Null Values and Three-Valued Logic
-
-
-
- 4.1 History and Standardization (ANSI SQL)
- 4.2 Data Definition Language (DDL)
- 4.3 Data Manipulation Language (DML)
- 4.4 Data Control Language (DCL)
- 4.5 Basic Queries (SELECT)
- 4.6 WHERE Clause and Conditions
- 4.7 ORDER BY and GROUP BY
- 4.8 HAVING Clause
- 4.9 Built-in Functions
- 4.10 Constraints in SQL
-
- 5.1 Subqueries (Correlated & Nested)
- 5.2 Joins (Inner, Outer, Self, Cross)
- 5.3 Set Operations (UNION, INTERSECT, EXCEPT)
- 5.4 Window Functions
- 5.5 Common Table Expressions (CTE)
- 5.6 Recursive Queries
- 5.7 Stored Procedures
- 5.8 Triggers
- 5.9 Index Creation and Usage
- 5.10 Query Performance Analysis
-
- 6.1 PL/SQL (Oracle Database)
- 6.2 T-SQL (Microsoft SQL Server)
- 6.3 Functions and Packages
- 6.4 Error Handling
- 6.5 Cursors
- 6.6 Dynamic SQL
-
-
-
- 7.1 Definition and Notation
- 7.2 Armstrong’s Axioms
- 7.3 Closure of Attribute Sets
- 7.4 Canonical Cover
- 7.5 Dependency Preservation
-
- 8.1 Anomalies in Databases
- 8.2 First Normal Form (1NF)
- 8.3 Second Normal Form (2NF)
- 8.4 Third Normal Form (3NF)
- 8.5 Boyce-Codd Normal Form (BCNF)
- 8.6 Fourth Normal Form (4NF)
- 8.7 Fifth Normal Form (5NF)
- 8.8 Multi-Valued Dependencies
- 8.9 Join Dependencies
- 8.10 Denormalization Strategies
-
- 9.1 Top-Down Design
- 9.2 Bottom-Up Design
- 9.3 View Integration
- 9.4 Schema Refinement
- 9.5 Case Studies (Banking, E-Commerce, Healthcare)
-
-
-
- 10.1 Definition of Transaction
- 10.2 ACID Properties
- 10.3 Transaction States
- 10.4 Schedules and Serializability
- 10.5 Conflict vs View Serializability
-
- 11.1 Lock-Based Protocols
- 11.2 Two-Phase Locking (2PL)
- 11.3 Deadlocks and Prevention
- 11.4 Timestamp Ordering
- 11.5 Multiversion Concurrency Control (MVCC)
- 11.6 Optimistic Concurrency Control
- 11.7 Isolation Levels (Read Committed, Serializable)
-
- 12.1 Failure Types
- 12.2 Log-Based Recovery
- 12.3 Checkpointing
- 12.4 Shadow Paging
- 12.5 ARIES Algorithm
- 12.6 Backup and Restore Strategies
-
-
-
- 13.1 File Organization
- 13.2 Heap Files
- 13.3 Sequential Files
- 13.4 Hashing
- 13.5 RAID Architecture
-
- 14.1 Ordered Indices
- 14.2 B-Tree
- 14.3 B+ Tree
- 14.4 Hash Indexing
- 14.5 Bitmap Indexes
- 14.6 R-Tree (Spatial Index)
- 14.7 Index Selection Strategies
-
-
-
- 15.1 Query Parsing
- 15.2 Query Translation
- 15.3 Logical Query Plan
- 15.4 Physical Query Plan
- 15.5 Join Algorithms
- Nested Loop Join
- Sort-Merge Join
- Hash Join
-
- 16.1 Cost-Based Optimization
- 16.2 Heuristic Optimization
- 16.3 Statistics and Histograms
- 16.4 Execution Plan Analysis
- 16.5 Distributed Query Optimization
-
-
-
- 17.1 Distributed Database Concepts
- 17.2 Fragmentation (Horizontal/Vertical)
- 17.3 Replication Strategies
- 17.4 Distributed Transactions
- 17.5 Two-Phase Commit (2PC)
- 17.6 Three-Phase Commit (3PC)
-
- 18.1 Limitations of RDBMS
- 18.2 CAP Theorem
- 18.3 Key-Value Stores
- 18.4 Document Databases
- MongoDB
- 18.5 Column-Family Databases
- Apache Cassandra
- 18.6 Graph Databases
- Neo4j
- 18.7 Comparison: SQL vs NoSQL
-
- 19.1 Distributed SQL
- 19.2 Horizontal Scalability
- 19.3 Examples
- CockroachDB
- Google Spanner
-
-
-
- 20.1 OLTP vs OLAP
- 20.2 Star Schema
- 20.3 Snowflake Schema
- 20.4 ETL Process
- 20.5 Data Marts
- 20.6 OLAP Operations
-
- 21.1 Big Data Characteristics
- 21.2 Hadoop Architecture
- Apache Hadoop
- 21.3 MapReduce
- 21.4 Spark
- Apache Spark
- 21.5 Data Lakes
-
-
-
- 22.1 Security Threats
- 22.2 Access Control Models
- 22.3 SQL Injection Attacks
- 22.4 Encryption (TDE)
- 22.5 Auditing and Logging
-
- 23.1 GDPR Overview
- 23.2 HIPAA
- 23.3 Data Masking
- 23.4 Secure Database Design
-
-
-
- Redis
- SAP HANA
-
- 25.1 Distributed Ledger Concepts
- 25.2 Smart Contracts
- 25.3 Database vs Blockchain
-
- 26.1 Vector Databases
- 26.2 Embedding Storage
- 26.3 Similarity Search
- 26.4 AI-Augmented Query Optimization
-
-
-
- 27.1 Query Profiling
- 27.2 Index Optimization
- 27.3 Caching Strategies
- 27.4 Scaling Strategies
-
- 28.1 CI/CD for Databases
- 28.2 Schema Migration Tools
- 28.3 Backup Automation
- 28.4 Monitoring Tools
-
- 29.1 Large-Scale Banking Systems
- 29.2 Social Media Architecture
- 29.3 E-Commerce Platforms
- 29.4 High-Frequency Trading Systems
-
-
- A. Relational Algebra Formal Proofs
- B. SQL Reference Manual
- C. Sample Database Projects
- D. Interview Questions (Basic → Advanced)
- E. Research Papers and Further Reading
The history of data management is a fascinating journey that parallels the evolution of computing itself. Understanding this evolution provides crucial context for appreciating modern database systems.
Pre-Mechanical Era (Before 1800s) Before computers, data management was entirely manual. Information was recorded on clay tablets, papyrus scrolls, and paper ledgers. Ancient civilizations like the Sumerians maintained grain inventories on clay tablets around 3400 BCE. The Library of Alexandria, established in the 3rd century BCE, represented one of humanity's first large-scale attempts at organized data storage, containing hundreds of thousands of papyrus scrolls cataloged by subject matter.
Mechanical Era (1800s - 1940s) The Industrial Revolution brought mechanical aids to data processing. Herman Hollerith's punched card tabulating machine, developed for the 1890 U.S. Census, could process data ten times faster than manual methods. This invention led to the founding of the Tabulating Machine Company, which eventually became IBM. Punched cards remained a dominant data storage medium until the 1970s, with each card typically holding 80 characters of data.
Early Electronic Era (1940s - 1950s) The first electronic computers, like ENIAC (1945), used vacuum tubes and had minimal data storage capabilities. Programs were wired manually, and data was entered through switches or punched cards. Magnetic tape, introduced in 1951 with the UNIVAC I, allowed sequential data access but required reading entire tapes to find specific records. This period marked the beginning of automated data processing but lacked the random access capabilities we take for granted today.
File-Based Systems (1950s - 1960s) As magnetic disks became available in the late 1950s, file-based systems emerged. Each application maintained its own files, leading to significant data redundancy. A university might have separate student files for the registrar's office, financial aid, and housing—each containing duplicate personal information. This approach worked for simple applications but created numerous problems: data inconsistency when the same information was updated in one file but not others, limited data sharing, and no standardized access methods.
Database Management Systems Emerge (1960s - 1970s) The concept of a centralized, integrated data repository gained traction in the 1960s. Charles Bachman at General Electric developed the Integrated Data Store (IDS), considered the first DBMS, using a network data model. IBM's Information Management System (IMS), developed for the Apollo space program, introduced the hierarchical model. These early systems separated data from applications and provided programmatic access methods.
Relational Revolution (1970 - Present) Edgar F. Codd's 1970 paper "A Relational Model of Data for Large Shared Data Banks" fundamentally changed database thinking. Codd proposed organizing data into tables (relations) with mathematical foundations in set theory and predicate logic. This abstraction freed users from understanding physical storage details. IBM's System R prototype (1974-1979) and UC Berkeley's INGRES project demonstrated the relational model's practicality, leading to commercial products like Oracle (1979), DB2 (1983), and later open-source systems like MySQL (1995) and PostgreSQL (1996).
Modern Era (2000s - Present) The internet explosion created new challenges: massive scale, unstructured data, and always-available requirements. Google's BigTable (2004) and Amazon's Dynamo (2007) papers inspired NoSQL movement. Today's landscape includes relational databases, document stores, graph databases, time-series databases, and NewSQL systems, each optimized for specific use cases.
Understanding the fundamental differences between file-based approaches and database management systems illuminates why databases became essential.
File System Approach In file-based systems, applications directly create and manipulate operating system files. A payroll application might store employee records in a fixed-length format within a flat file:
Employee Record Format (Fixed Length - 100 bytes):
- Employee ID: 6 bytes (numeric)
- Last Name: 20 bytes (text)
- First Name: 15 bytes (text)
- Department: 4 bytes (numeric)
- Salary: 9 bytes (packed decimal)
- Hire Date: 8 bytes (YYYYMMDD)
- Remaining: 38 bytes (unused/spare)
The application code must handle:
- Opening/closing files with appropriate permissions
- Positioning file pointers to read/write specific records
- Parsing raw bytes into application data structures
- Managing concurrent access (typically using file locks)
- Ensuring data integrity across related files
Problems with File Systems:
-
Data Redundancy and Inconsistency Multiple applications maintain duplicate data. A customer's address might appear in sales, support, and billing files. When the customer moves, updating all copies is error-prone, leading to inconsistent data across the organization.
-
Data Isolation Data exists in different formats and locations. Marketing's customer file might use comma-separated values, while sales uses fixed-width format. Creating integrated reports requires complex, custom programming to reconcile these differences.
-
Integrity Problems File systems don't enforce business rules. Nothing prevents a payroll application from entering negative salary values or assigning employees to non-existent departments. Each application must implement its own validation logic, often inconsistently.
-
Atomicity Issues If a funds transfer application crashes after debiting one account but before crediting another, the system becomes inconsistent. File systems provide no built-in mechanism to ensure that operations complete entirely or not at all.
-
Concurrency Control When multiple users simultaneously access files, problems arise. Consider two travel agents booking the last seat on a flight. Without proper coordination, both might sell the same seat, leading to overbooking.
-
Security Limitations File systems provide coarse access control (read/write/execute at file level). Granting a user access to salary information typically means granting access to the entire payroll file, exposing all employees' salaries.
Database System Advantages:
-
Self-Describing Nature A DBMS contains not just data but metadata—the database catalog or dictionary stores information about data structures, constraints, and relationships. Applications query this catalog to understand data formats, enabling greater flexibility and self-documentation.
-
Data Abstraction The DBMS provides multiple views of data. Physical storage details (file locations, block sizes, index structures) are hidden from users. The conceptual schema describes the overall logical structure, while external schemas provide customized user views. A personnel manager might see employee names and departments without salary information.
-
Controlled Redundancy While some redundancy might exist for performance, the DBMS manages it. Through normalization theory and integrity constraints, the system minimizes redundancy while maintaining consistency. When redundancy is intentionally introduced (like denormalization for performance), the DBMS helps maintain synchronization.
-
Data Independence Applications become immune to changes in storage structures and access methods. When the DBA adds indexes for performance, existing applications continue working unchanged. This logical and physical data independence revolutionized application development.
-
Efficient Data Access DBMS employs sophisticated optimization techniques. A query like "find all employees in department 42 earning more than $100,000" might be executed using indexes, hash lookups, or full table scans—the DBMS chooses the most efficient strategy based on data statistics.
-
Transaction Support ACID properties (Atomicity, Consistency, Isolation, Durability) ensure reliable data processing. The DBMS guarantees that transactions complete fully, maintain consistency, don't interfere with each other, and survive system failures.
-
Concurrent Access Sophisticated locking and multiversion concurrency control allow thousands of users to simultaneously access and modify data while maintaining consistency. The DBMS automatically manages lock granularity (row-level, page-level, table-level) based on workload characteristics.
-
Security and Authorization Fine-grained access control can restrict users to specific rows and columns. A manager might view their direct reports' salaries but not other departments. The DBMS enforces these restrictions regardless of which application accesses the data.
-
Backup and Recovery Automated backup procedures and recovery mechanisms protect against data loss. Point-in-time recovery can restore a database to its state just before a erroneous UPDATE statement. Log-based recovery ensures no committed transactions are lost even after system crashes.
Modern database systems exhibit several defining characteristics that distinguish them from simpler data storage mechanisms.
1. Persistence Data outlives the programs that create or access it. When an application terminates, the data remains safely stored on secondary storage. Even system failures don't compromise data durability—committed transactions survive crashes and restarts.
2. Data Integration The database serves as a central repository for multiple applications and users. Sales, inventory, accounting, and human resources all access the same integrated data store, eliminating artificial barriers between organizational functions.
3. Data Sharing Multiple users and applications simultaneously access the database. The DBMS manages this sharing, ensuring that concurrent operations don't interfere with each other. A customer service representative in New York and a warehouse clerk in London can update the same order simultaneously without conflict.
4. Controlled Redundancy While some redundancy might exist, it's carefully managed. Unlike file systems where redundancy is accidental and uncontrolled, database redundancy is intentional and controlled through mechanisms like:
- Foreign keys that maintain referential integrity
- Materialized views that automatically refresh
- Replication that synchronizes copies
5. Data Consistency Integrity constraints ensure data accuracy and correctness at all times. Constraints include:
- Domain constraints (salary must be positive)
- Key constraints (employee IDs must be unique)
- Entity integrity (primary keys cannot be null)
- Referential integrity (foreign keys must reference existing records)
- Business rules (a manager's salary must exceed their subordinates' salaries)
6. Query Capability Users can ask ad-hoc questions without programming. A marketing analyst can query "show me customers who purchased product X but not product Y in the last 30 days" without writing custom code. The DBMS's query processor optimizes and executes such requests efficiently.
7. Data Independence Applications remain unaffected by changes in data storage and access methods. When the DBA decides to partition a large table across multiple disks, existing queries continue working. When new indexes are added for performance, applications don't need modification.
8. Security Comprehensive security mechanisms protect data from unauthorized access. Features include:
- Authentication (verifying user identity)
- Authorization (controlling access to specific data)
- Encryption (protecting data in transit and at rest)
- Auditing (tracking who accessed what and when)
9. Transaction Management The DBMS ensures reliable processing of business operations through ACID properties:
Atomicity: Transactions are all-or-nothing. If a funds transfer consists of debiting account A and crediting account B, either both operations occur or neither does. Partial execution is impossible.
Consistency: Transactions transform the database from one consistent state to another. All defined integrity constraints hold before and after transaction execution.
Isolation: Concurrent transactions appear to execute sequentially. Even with hundreds of simultaneous users, each transaction sees a consistent database state as if it were the only active transaction.
Durability: Once committed, transaction effects persist even through system failures. If the power fails immediately after confirming a funds transfer, the transfer remains complete when the system restarts.
10. Scalability Database systems can grow with organizational needs. Vertical scalability adds more powerful hardware (more CPUs, memory, faster storage). Horizontal scalability distributes data across multiple servers (sharding, replication, partitioning).
Data models provide abstractions for describing data, relationships, and constraints. Different models emphasize different aspects of data management.
Hierarchical Model (1960s) IBM's IMS popularized the hierarchical model, organizing data as tree structures. Each record has a single parent (except root records) and can have multiple children. This 1:N relationship structure worked well for Bill of Materials (a product consists of subassemblies, which consist of parts) but struggled with many-to-many relationships.
Example: A university hierarchical database might structure data as:
- University (root)
- Colleges (children of University)
- Departments (children of Colleges)
- Faculty (children of Departments)
- Courses (children of Departments)
- Students (children of Courses)
- Departments (children of Colleges)
- Colleges (children of University)
Problems arise when a student takes multiple courses—the student would appear multiple times in the tree, introducing redundancy and update anomalies.
Network Model (1970s) The CODASYL (Conference on Data Systems Languages) network model extended the hierarchical approach to support many-to-many relationships using sets. Records could have multiple parents, creating a network structure. While more flexible than hierarchical model, the network model required programmers to navigate complex pointer structures, making application development challenging.
Relational Model (1970s-Present) Codd's relational model organizes data into tables (relations) with rows (tuples) and columns (attributes). Relationships are represented through common values rather than explicit pointers. The model's mathematical foundation in set theory and predicate logic provides powerful query capabilities through relational algebra and calculus.
Advantages:
- Simple, intuitive representation
- Strong theoretical foundation
- Data independence through logical views
- Declarative query languages (SQL)
- Well-understood normalization theory
Entity-Relationship Model (1976) Peter Chen's ER model provides a conceptual design tool bridging the gap between real-world requirements and database implementation. Entities represent objects (Student, Course), attributes describe entity properties (student_name, course_credits), and relationships connect entities (Enrollment connects Students and Courses).
ER diagrams use standardized notation:
- Rectangles for entities
- Diamonds for relationships
- Ovals for attributes
- Lines with cardinality indicators (1:1, 1:N, M:N)
Object-Oriented Model (1980s-1990s) Object-oriented databases attempted to address impedance mismatch between OOP languages and relational databases. Objects encapsulate both data and behavior, supporting inheritance, polymorphism, and complex data types. While influential, OODBMS never achieved mainstream adoption outside specialized domains like CAD/CAM and telecommunications.
Object-Relational Model (1990s-Present) Modern relational databases incorporate object-oriented features: user-defined types, inheritance, methods, and complex data structures. PostgreSQL pioneered this approach, supporting array types, composite types, and custom functions. Oracle's object-relational features include object tables, REF types, and method definitions.
Semantic Models (1990s-Present) Semantic data models emphasize meaning and relationships. The Resource Description Framework (RDF) represents information as subject-predicate-object triples, enabling semantic web applications. Knowledge graphs and ontology-based models support AI and machine learning applications by capturing domain knowledge and inference rules.
NoSQL Models (2000s-Present) The NoSQL movement encompasses multiple data models:
Key-Value Stores: Simple, high-performance systems storing arbitrary data under unique keys. Redis and Amazon DynamoDB exemplify this model, supporting millions of operations per second.
Document Stores: Semi-structured data in JSON, BSON, or XML formats. MongoDB's document model allows nested structures and varying schemas within the same collection, supporting agile development and evolving requirements.
Column-Family Stores: Google BigTable and Apache Cassandra organize data by column families, optimizing for wide tables with many attributes. This model excels at time-series data, sensor readings, and analytics workloads.
Graph Databases: Neo4j and Amazon Neptune store nodes (entities) and edges (relationships), optimizing for connected data queries like social networks, recommendation engines, and fraud detection.
Understanding the various stakeholders in a database environment helps clarify requirements and responsibilities.
Database Administrators (DBAs) DBAs are the technical guardians of the database environment. Their responsibilities span:
Installation and Configuration: Setting up DBMS software, configuring memory parameters, storage allocation, and network settings. A DBA might configure Oracle's SGA (System Global Area) size based on available memory and workload characteristics.
Security Management: Creating user accounts, assigning roles, managing privileges, and implementing encryption. DBAs establish authentication mechanisms (passwords, LDAP, Kerberos) and authorization policies.
Backup and Recovery: Designing backup strategies (full, incremental, differential), scheduling backups, testing restore procedures, and recovering from failures. A banking DBA might implement continuous archiving to support point-in-time recovery.
Performance Monitoring: Tracking system performance, identifying bottlenecks, tuning queries, and optimizing indexes. DBAs use tools like Oracle Enterprise Manager, PostgreSQL pg_stat_statements, and custom monitoring scripts.
Capacity Planning: Forecasting growth, planning hardware upgrades, and ensuring adequate resources. DBAs analyze trends in data volume, transaction rates, and user concurrency.
Schema Management: Designing and maintaining database structures, implementing changes through controlled procedures, and ensuring referential integrity.
Database Designers Designers create the logical and physical database structure. Working with application developers and end users, they:
- Analyze requirements and create conceptual data models
- Transform conceptual models into normalized relational schemas
- Design indexes, partitioning strategies, and storage layouts
- Document design decisions and constraints
Application Developers Developers build applications that interact with databases. Their responsibilities include:
- Writing SQL queries and stored procedures
- Implementing data access layers in programming languages (Java, Python, C#)
- Managing connections and transaction boundaries
- Handling errors and exceptional conditions
- Optimizing application-database interaction
End Users End users consume database information through applications:
Naive Users: Use applications with pre-defined interfaces. An airline reservation agent enters passenger information without writing queries or understanding database structure.
Sophisticated Users: Query databases directly using SQL or reporting tools. Business analysts might write complex queries to analyze sales trends or customer behavior.
Specialized Users: Build database-backed applications. Data scientists extract training data for machine learning models. ETL developers design data integration pipelines.
System Analysts Analysts bridge business requirements and technical implementation. They:
- Gather and document user requirements
- Create use cases and business process models
- Validate database designs against business needs
- Ensure compliance with regulations and policies
Data Architects Architects focus on enterprise-wide data strategy:
- Establish data standards and best practices
- Design integration patterns across systems
- Plan data migration and consolidation strategies
- Govern data quality and master data management
The ANSI/SPARC (American National Standards Institute/Standards Planning and Requirements Committee) architecture, proposed in 1975, established a three-level framework for database systems that remains influential today.
Motivation for Three-Level Architecture Before ANSI/SPARC, database systems tightly coupled user views with physical storage. Any change to storage structures required modifying applications. The three-level architecture aimed to provide:
- Data Independence: Applications insulated from physical storage changes
- Multiple Views: Different users see customized data representations
- Centralized Control: DBMS manages all data access and integrity
The ANSI/SPARC architecture comprises three distinct levels:
External Level (User Views) The external level describes how individual users perceive the database. Each user group has its own external schema (subschema) showing only relevant data. Examples:
A payroll clerk's external view might include:
CREATE VIEW PayrollView AS
SELECT employee_id, employee_name, department, salary, bank_account
FROM employees
WHERE status = 'ACTIVE';A project manager's external view might show:
CREATE VIEW ProjectAssignmentView AS
SELECT e.employee_name, p.project_name, a.hours_allocated,
a.start_date, a.end_date
FROM employees e, projects p, assignments a
WHERE e.employee_id = a.employee_id
AND p.project_id = a.project_id;Benefits of external schemas:
- Security: Users see only authorized data
- Simplicity: Complex underlying structures hidden
- Customization: Data presented in most useful format
- Independence: Underlying changes don't affect views if structure preserved
Conceptual Level (Community View) The conceptual level describes the entire database structure without physical details. It represents:
- All entities, attributes, and relationships
- Integrity constraints and business rules
- Security and authorization requirements
- Data independence from storage considerations
A conceptual schema for a university might define:
CREATE TABLE students (
student_id INTEGER PRIMARY KEY,
name VARCHAR(100) NOT NULL,
date_of_birth DATE,
email VARCHAR(200) UNIQUE,
enrollment_date DATE DEFAULT CURRENT_DATE
);
CREATE TABLE courses (
course_id CHAR(10) PRIMARY KEY,
title VARCHAR(200) NOT NULL,
credits INTEGER CHECK (credits BETWEEN 1 AND 6),
department VARCHAR(50)
);
CREATE TABLE enrollments (
student_id INTEGER REFERENCES students,
course_id CHAR(10) REFERENCES courses,
semester CHAR(6),
grade CHAR(2),
PRIMARY KEY (student_id, course_id, semester)
);Internal Level (Physical Storage) The internal level describes how data is physically stored and accessed:
- File organizations (heap, sequential, hash)
- Index structures (B-trees, hash indexes)
- Storage allocations (tablespaces, data files)
- Compression and encryption methods
- Data placement across disks
Example internal specifications:
Table: students
Storage Organization: Heap file with clustering on student_id
Indexes:
- Primary B-tree index on student_id (clustered)
- Secondary hash index on email
Tablespace: USER_DATA01 on disk group DATA_DG1
Row Format: Variable length with compression
Blocksize: 8KB
Data independence—the immunity of applications to changes in data representation—is a fundamental benefit of the three-level architecture.
Physical Data Independence Applications and conceptual schemas remain unchanged when physical storage changes. Examples of physical changes that shouldn't affect applications:
- Index Changes: Adding or dropping indexes to improve performance
- File Organization: Switching from heap to hash organization
- Storage Devices: Moving from HDD to SSD storage
- Partitioning: Splitting a table across multiple disks
- Compression: Implementing or changing compression algorithms
- Reorganization: Rebuilding tables to eliminate fragmentation
Logical Data Independence External schemas (user views) remain unchanged when conceptual schema changes. Examples of logical changes that shouldn't affect external views:
- Adding New Attributes: Adding columns to tables
- Creating New Tables: Introducing new entity types
- Splitting Tables: Decomposing tables for normalization
- Merging Tables: Combining tables for performance
- Adding Relationships: Introducing new associations
Achieving logical data independence is more challenging than physical independence. View mechanisms must shield users from underlying changes through view definitions that map external to conceptual structures.
Mappings Between Levels The DBMS maintains mappings connecting the three levels:
External/Conceptual Mapping: Translates user queries on external views to operations on conceptual schema. When a payroll clerk queries the PayrollView, the DBMS maps this to appropriate conceptual-level tables.
Conceptual/Internal Mapping: Translates conceptual operations to physical storage access. When the conceptual layer requests employee records, the internal mapping determines which files, blocks, and indexes to access.
These mappings enable data independence—changes at one level require only mapping adjustments, not application modifications.
A complete database system comprises multiple interacting components:
Hardware Components:
- Processors: CPUs executing DBMS code and application queries
- Memory: RAM for buffer caches, sort areas, and program execution
- Storage: Persistent storage (HDD, SSD) for data files, logs, and backups
- Network: Interconnects allowing client-server communication
Software Components:
DBMS Software:
- Query Processor: Parses, validates, optimizes, and executes queries
- Storage Manager: Manhes data files, buffer pools, and disk I/O
- Transaction Manager: Coordinates concurrent access and recovery
- Lock Manager: Implements concurrency control protocols
- Log Manager: Records changes for recovery purposes
- Catalog Manager: Maintains metadata about database objects
Supporting Software:
- Operating System: Provides process scheduling, file management, networking
- Network Software: Enables client-server communication
- Application Programs: Provide user interfaces and business logic
People Components:
- Database administrators
- Application developers
- End users
- System support staff
Procedures:
- Startup and shutdown procedures
- Backup and recovery processes
- Security audit protocols
- Performance monitoring routines
- Change management processes
Oracle Database Oracle, first released in 1979, remains the market leader in enterprise database systems. Key features:
Multitenant Architecture: Pluggable databases allow consolidation of multiple databases into a single container database, simplifying management and resource allocation.
Real Application Clusters (RAC): Multiple instances access a single database, providing high availability and scalability. If one node fails, others continue processing.
Automatic Storage Management (ASM): Volume manager and file system optimized for database files, providing striping, mirroring, and online reconfiguration.
Advanced Security: Transparent Data Encryption (TDE), Data Redaction, and Database Vault protect sensitive information.
Oracle Exadata: Database machine integrating hardware and software for extreme performance. Smart Scan offloads query processing to storage servers.
MySQL MySQL, acquired by Sun Microsystems (later Oracle), dominates web applications. Features:
Storage Engine Architecture: Pluggable storage engines (InnoDB, MyISAM, Memory) allow workload-specific optimization. InnoDB provides ACID compliance with row-level locking; MyISAM offers table-level locking with full-text search.
Replication: Master-slave replication supports read scaling and high availability. Group replication provides multi-master capabilities.
Partitioning: Range, list, hash, and key partitioning distribute data for manageability and performance.
JSON Support: Native JSON data type with efficient storage and indexing of JSON documents.
PostgreSQL PostgreSQL, the "world's most advanced open source database," originated from UC Berkeley's POSTGRES project. Features:
Extensibility: Users can define data types, operators, index methods, and functions. PostGIS extends PostgreSQL with geospatial capabilities.
MVCC Implementation: Multiversion concurrency control provides snapshot isolation without read locks.
Advanced Indexing: B-tree, Hash, GiST, SP-GiST, GIN, and BRIN indexes support diverse query patterns.
Table Partitioning: Declarative partitioning simplifies management of large tables.
Logical Replication: Publish-subscribe replication enables selective data distribution and upgrades with minimal downtime.
Microsoft SQL Server SQL Server, first released for OS/2 in 1989, evolved into a comprehensive data platform. Features:
Integration with Microsoft Ecosystem: Tight integration with Visual Studio, .NET, Azure, and Power BI.
Columnstore Indexes: Optimize analytical workloads by storing data column-wise rather than row-wise.
In-Memory OLTP: Memory-optimized tables and natively compiled stored procedures accelerate transaction processing.
Always On Availability Groups: Provide high availability and disaster recovery with readable secondaries.
Intelligent Query Processing: Adaptive joins, batch mode, and interleaved execution optimize query performance.
Data models are abstractions that describe how data is represented, organized, and manipulated. They provide the conceptual tools for database design and implementation.
Purpose of Data Models:
-
Communication: Data models facilitate communication between stakeholders—business users, analysts, and technical implementers. An entity-relationship diagram conveys meaning without requiring technical database knowledge.
-
Precision: Formal modeling notation ensures precise specification of requirements, reducing ambiguity and misinterpretation.
-
Documentation: Models serve as living documentation of the database structure, supporting maintenance, evolution, and knowledge transfer.
-
Design Validation: Before investing in implementation, models allow validation of requirements, identification of missing elements, and discovery of inconsistencies.
-
Implementation Blueprint: Models guide physical database creation, providing specifications for tables, constraints, and indexes.
Levels of Data Models:
Conceptual Data Models: High-level representations focusing on business concepts and relationships, independent of implementation details. Used primarily for requirements analysis and communication with business stakeholders.
Logical Data Models: More detailed representations specifying data structures, attributes, keys, and constraints, but still independent of specific DBMS products. Represent what the system should contain.
Physical Data Models: Implementation-specific models describing storage structures, indexes, partitioning, and performance optimizations for a particular DBMS. Represent how data will be stored and accessed.
Object-based logical models describe data at the conceptual and logical levels using concepts close to how humans perceive reality.
Entity-Relationship Model: The most widely used object-based model, describing the world in terms of entities (things) and relationships (associations).
Object-Oriented Model: Extends object-oriented programming concepts to databases. Objects encapsulate state (attributes) and behavior (methods). Classes define object types, supporting inheritance hierarchies.
Semantic Models: Emphasize the meaning of data and relationships. The NIAM (Nijssen's Information Analysis Methodology) and ORM (Object-Role Modeling) approaches focus on facts and roles rather than entities and attributes.
Record-based models structure data as fixed-format records, forming the basis for most implemented database systems.
Relational Model: Data organized as tables (relations) with rows (tuples) and columns (attributes). Relationships represented through common values rather than explicit pointers.
Network Model: Data organized as record types connected in a network structure using sets. Each record can participate in multiple set relationships, allowing many-to-many associations.
Hierarchical Model: Data organized as tree structures with parent-child relationships. Each child has exactly one parent, limiting flexibility but providing efficient access for hierarchical data.
Physical data models describe how data is actually stored and accessed on storage media.
Storage Structures:
- Heap files: Records stored without any particular order
- Sequential files: Records stored in key sequence
- Hashing: Records placed in buckets based on hash function
- Clustering: Related records stored together physically
Access Methods:
- B-Trees: Balanced tree structures supporting efficient search, insert, delete
- Hash Indexes: Direct access based on hash function
- Bitmap Indexes: Bit arrays for low-cardinality columns
- R-Trees: Spatial access methods for multidimensional data
Storage Allocation:
- Extents: Contiguous blocks allocated to database objects
- Segments: Logical storage units containing all extents for an object
- Tablespaces: Logical containers grouping related segments
Conceptual Design Phase:
Activities:
- Identify entities and relationships
- Determine business rules and constraints
- Create ER diagrams
- Validate with business stakeholders
Deliverable: Conceptual data model (ER diagram, business rules documentation)
Example for university database:
- Entities: Student, Professor, Course, Department, Classroom
- Relationships: Enrolls (Student-Course), Teaches (Professor-Course), Belongs (Student-Department)
- Business rules: A student must declare a major department; A course can have multiple sections
Logical Design Phase:
Activities:
- Transform conceptual model to relational schema
- Normalize relations (1NF, 2NF, 3NF, BCNF)
- Define primary keys, foreign keys, and constraints
- Specify views and access patterns
Deliverable: Relational schema with normalized tables, constraints, and views
Example transformations:
- Many-to-many relationship (Student-Course) becomes separate Enrollment table
- Multivalued attributes become separate tables
- Inheritance hierarchies mapped using various strategies
Physical Design Phase:
Activities:
- Choose storage structures and file organizations
- Design indexes based on query patterns
- Determine partitioning and clustering strategies
- Estimate storage requirements and plan capacity
Deliverable: Physical database design with DDL scripts, indexing strategy, storage configuration
Example physical decisions:
- Create clustered index on student_id for Student table
- Partition Enrollment table by semester for better manageability
- Add bitmap indexes on low-cardinality columns (gender, status)
Abstraction and generalization are fundamental cognitive tools for managing complexity in data modeling.
Abstraction focuses on essential characteristics while ignoring irrelevant details. In data modeling, abstraction allows us to represent real-world phenomena at appropriate levels of detail.
Types of abstraction:
Classification: Grouping objects with common properties into classes. The individual student "John Smith" is an instance of the class "Student."
Aggregation: Treating a collection of related objects as a higher-level object. A "Course" aggregates multiple "Enrollment" instances and a "Professor."
Generalization: Extracting common properties from multiple classes into a superclass. "Professor" and "Student" generalize to "Person."
Specialization: Defining subclasses with additional specific properties. "Person" specializes to "Employee" and "Student."
Peter Chen's Entity-Relationship model, introduced in 1976, remains the most influential conceptual modeling approach.
Entities and Entity Sets:
An entity is a distinct object in the real world distinguishable from other objects. An entity set is a collection of similar entities.
Examples:
- Entity: Student with ID 12345
- Entity set: All students enrolled at the university
Strong vs Weak Entities:
Strong entities exist independently. Weak entities depend on strong entities for identification.
Example: A "Room" might be a weak entity dependent on "Building"—room numbers might repeat across buildings, requiring building identification for uniqueness.
Attributes:
Attributes describe properties of entities or relationships:
-
Simple vs Composite: Simple attributes are atomic (age); composite attributes have components (address: street, city, zip)
-
Single-valued vs Multivalued: Single-valued attributes have one value (birth_date); multivalued can have multiple (phone_numbers)
-
Stored vs Derived: Stored attributes persist in database (birth_date); derived are computed (age from birth_date and current date)
-
Null-valued: Attributes that may not have values for all entities
Keys:
- Superkey: Set of attributes uniquely identifying entities
- Candidate key: Minimal superkey (no proper subset is a superkey)
- Primary key: Selected candidate key for entity identification
- Foreign key: Attribute(s) referencing primary key of another entity
Relationships and Relationship Sets:
A relationship is an association among entities. A relationship set is a collection of similar relationships.
Degree of Relationship:
- Unary (recursive): Relationship involving same entity set (e.g., "manages" between Employee and Employee)
- Binary: Relationship between two entity sets (most common)
- Ternary: Relationship among three entity sets (e.g., "prescribes" among Doctor, Patient, Medication)
Cardinality Constraints:
Specify how many instances of one entity can relate to another:
- One-to-One (1:1): Each entity in set A relates to at most one in set B
- One-to-Many (1:N): Each entity in A relates to many in B; each in B relates to at most one in A
- Many-to-Many (M:N): Entities in both sets can relate to many in the other
Participation Constraints:
- Total participation (existence dependency): Every entity in set must participate in relationship
- Partial participation: Entities may or may not participate
Example: Every student must have an advisor (total participation); not every professor advises students (partial participation)
ER Diagram Notation:
Standard Chen notation:
- Rectangles: Entity sets
- Diamonds: Relationship sets
- Ovals: Attributes
- Underlined attributes: Primary keys
- Double ovals: Multivalued attributes
- Dashed ovals: Derived attributes
- Double rectangles: Weak entity sets
- Double diamonds: Identifying relationships
- Lines connecting components with cardinality indicators
The Extended Entity-Relationship model adds semantic concepts to handle more complex modeling situations.
Specialization and Generalization:
Specialization is the process of defining subclasses of an entity type. A "Student" entity might specialize to "GraduateStudent" and "UndergraduateStudent."
Generalization is the reverse process—identifying common features among entity types to define a superclass. "Professor" and "Student" generalize to "Person."
Constraints on Specialization/Generalization:
-
Disjointness constraint: Subclasses may be disjoint (an entity belongs to at most one) or overlapping (can belong to multiple)
-
Completeness constraint: Total specialization means every superclass entity must belong to some subclass; partial allows entities without subclass membership
Union Types (Categories):
A union type represents a superclass formed from multiple entity types with different keys. Example: A "VehicleOwner" might be a Person or a Company.
Aggregation:
Treating a relationship as an entity for participation in other relationships. If "ProjectAssignment" (relationship between Employee and Project) needs to relate to "Budget," we aggregate the relationship into an entity.
Unified Modeling Language (UML), primarily designed for software engineering, provides class diagrams suitable for data modeling.
Class Notation:
- Classes: Rectangles with name, attributes, operations
- Attributes: Shown with visibility, type, multiplicity
- Associations: Lines between classes with role names, multiplicity
Advantages of UML:
- Integration with object-oriented design
- Support for operations (methods) not just data
- Rich constraint specification (OCL)
- Widely understood by software developers
Requirements Analysis:
A university needs a database to manage:
- Student information (personal details, enrollment, academic records)
- Course offerings (catalog, sections, schedules)
- Faculty (instructors, advisors, department affiliations)
- Registration (enrollments, grades, transcripts)
- Facilities (classrooms, buildings, equipment)
Conceptual Design:
Identify entities and relationships:
Entities:
- Person (abstract superclass)
- Student (subclass)
- Faculty (subclass)
- Staff (subclass)
- Department
- Course
- CourseSection
- Classroom
- Building
Relationships:
- Person works_for Department
- Student majors_in Department
- Faculty belongs_to Department
- Course offered_by Department
- Faculty teaches CourseSection
- Student enrolls_in CourseSection (with Grade)
- CourseSection meets_in Classroom
- Classroom located_in Building
Attributes:
Person: person_id (PK), name, birth_date, address, phone, email Student: student_id (PK), classification, enrollment_date, GPA (derived) Faculty: faculty_id (PK), rank, hire_date, office_location Department: dept_code (PK), name, office_phone, budget Course: course_id (PK), title, credits, description CourseSection: section_id (PK), semester, year, capacity, schedule Classroom: room_id (PK), room_number, capacity, type Building: building_id (PK), name, address, floors
Cardinality Constraints:
- Student majors_in Department (N:1 - many students in one department)
- Faculty belongs_to Department (N:1)
- Department employs Faculty (1:N)
- Faculty teaches CourseSection (1:N - assuming one instructor per section)
- Student enrolls_in CourseSection (M:N - many students, many sections)
- CourseSection meets_in Classroom (N:1 - many sections in one classroom)
EER Extensions:
Generalization: Person is superclass of Student, Faculty, Staff (overlapping, total)
Logical Design (Relational Schema):
Transform to relational model:
CREATE TABLE persons (
person_id INTEGER PRIMARY KEY,
name VARCHAR(100) NOT NULL,
birth_date DATE,
address TEXT,
phone VARCHAR(20),
email VARCHAR(100) UNIQUE,
person_type CHAR(1) CHECK (person_type IN ('S', 'F', 'T')) -- discriminator
);
CREATE TABLE students (
student_id INTEGER PRIMARY KEY REFERENCES persons(person_id),
classification VARCHAR(20),
enrollment_date DATE NOT NULL
);
CREATE TABLE faculty (
faculty_id INTEGER PRIMARY KEY REFERENCES persons(person_id),
rank VARCHAR(30),
hire_date DATE NOT NULL,
office_location VARCHAR(50)
);
CREATE TABLE departments (
dept_code CHAR(4) PRIMARY KEY,
name VARCHAR(100) NOT NULL,
office_phone VARCHAR(20),
budget DECIMAL(10,2)
);
CREATE TABLE courses (
course_id CHAR(8) PRIMARY KEY,
title VARCHAR(200) NOT NULL,
credits INTEGER CHECK (credits BETWEEN 1 AND 6),
description TEXT,
dept_code CHAR(4) REFERENCES departments
);
CREATE TABLE course_sections (
section_id INTEGER PRIMARY KEY,
course_id CHAR(8) REFERENCES courses,
semester CHAR(6) NOT NULL, -- e.g., '202401' for Spring 2024
year INTEGER NOT NULL,
capacity INTEGER CHECK (capacity > 0),
schedule VARCHAR(50), -- e.g., 'MWF 10:00-10:50'
faculty_id INTEGER REFERENCES faculty,
classroom_id INTEGER REFERENCES classrooms
);
CREATE TABLE enrollments (
student_id INTEGER REFERENCES students,
section_id INTEGER REFERENCES course_sections,
grade CHAR(2),
enrollment_date DATE DEFAULT CURRENT_DATE,
PRIMARY KEY (student_id, section_id)
);
CREATE TABLE buildings (
building_id INTEGER PRIMARY KEY,
name VARCHAR(100) NOT NULL,
address TEXT,
floors INTEGER
);
CREATE TABLE classrooms (
classroom_id INTEGER PRIMARY KEY,
room_number VARCHAR(10) NOT NULL,
capacity INTEGER,
room_type VARCHAR(30),
building_id INTEGER REFERENCES buildings
);
-- Additional relationships
CREATE TABLE majors (
student_id INTEGER REFERENCES students,
dept_code CHAR(4) REFERENCES departments,
declared_date DATE,
PRIMARY KEY (student_id, dept_code)
);
CREATE TABLE department_employees (
person_id INTEGER REFERENCES persons,
dept_code CHAR(4) REFERENCES departments,
start_date DATE,
end_date DATE,
role VARCHAR(50),
PRIMARY KEY (person_id, dept_code, start_date)
);Physical Design:
Consider indexing and performance:
-- Clustered index on primary keys (automatically in many DBMS)
-- Additional indexes for foreign keys
CREATE INDEX idx_enrollments_student ON enrollments(student_id);
CREATE INDEX idx_enrollments_section ON enrollments(section_id);
CREATE INDEX idx_sections_course ON course_sections(course_id);
CREATE INDEX idx_sections_faculty ON course_sections(faculty_id);
CREATE INDEX idx_sections_classroom ON course_sections(classroom_id);
-- Index for search by name
CREATE INDEX idx_persons_name ON persons(name);
-- Composite index for semester lookups
CREATE INDEX idx_sections_semester ON course_sections(semester, year);
-- Partitioning strategy
-- Partition enrollments by year (assuming semester contains year)
-- Partition course_sections by semester
-- Tablespace allocation
-- Active data on fast SSD storage
-- Historical data on slower HDD storageThis design provides a foundation for a complete university information system, supporting registration, grade recording, scheduling, and reporting functions.
The relational model, proposed by E.F. Codd in 1970, organizes data into relations (tables) with a rigorous mathematical foundation. This structure provides both simplicity for users and formal precision for system implementers.
Mathematical Foundation:
A relation is defined as a subset of the Cartesian product of a list of domains. Formally, if D₁, D₂, ..., Dₙ are domains (sets of values), then a relation R is a set of n-tuples (d₁, d₂, ..., dₙ) such that d₁ ∈ D₁, d₂ ∈ D₂, ..., dₙ ∈ Dₙ.
This mathematical definition ensures:
- Each relation has a fixed arity (number of attributes)
- Tuples are unique (no duplicate rows)
- Attribute values come from specific domains
Relation Schema vs Relation Instance:
A relation schema describes the structure: relation name, attribute names, and domains. Example:
Student(sid: integer, sname: string, major: string, credits: integer)
A relation instance is the actual data at a particular moment—a set of tuples conforming to the schema. Instances change over time as data is inserted, updated, or deleted.
Properties of Relations:
-
Order Irrelevance: The ordering of tuples is immaterial. A relation is a set, not a list.
-
No Duplicate Tuples: As a set, a relation cannot contain identical tuples.
-
Attribute Atomicity: Attribute values must be atomic (indivisible) with respect to the relational model. This is the first normal form condition.
-
Tuple Uniqueness: Each tuple must be uniquely identifiable, typically through keys.
Understanding the precise terminology of the relational model is essential.
Relation (Table): A relation is a named, two-dimensional table representing an entity set or relationship set. The term "table" is commonly used, though strictly speaking, tables may have ordered rows and duplicate rows—violating pure relational model.
Tuple (Row): A tuple is an ordered set of attribute values, representing one entity or relationship instance. The order of values corresponds to the attribute order in the relation schema.
Attribute (Column): An attribute is a named column of a relation, representing a property of the entity or relationship. Each attribute has a domain specifying permissible values.
Degree (Arity): The number of attributes in a relation. A relation of degree 1 is unary, degree 2 binary, degree 3 ternary, and so on.
Cardinality: The number of tuples in a relation instance. Cardinality changes as data is inserted or deleted.
Example:
Relation: Student (degree = 4)
Attributes: sid (integer), sname (string), major (string), credits (integer)
Instance (cardinality = 3):
sid | sname | major | credits
----|----------|-----------|--------
101 | Alice | Computer Science | 45
102 | Bob | Mathematics | 30
103 | Carol | Physics | 52
Domains:
A domain is a set of atomic values of a particular type. Domains specify the permissible values for attributes and provide the first level of data integrity.
Examples of domains:
- Integer: all integers within specified range
- VARCHAR(20): character strings up to 20 characters
- DATE: valid calendar dates
- Boolean: TRUE or FALSE
- ENUM('Freshman', 'Sophomore', 'Junior', 'Senior'): enumerated values
Domains can have associated constraints:
- NOT NULL: values must be provided
- CHECK: additional restrictions (e.g., age >= 0)
- DEFAULT: value used if none specified
Integrity Constraints:
Integrity constraints are rules that must hold for all instances of the database. They ensure data correctness and consistency.
Domain Constraints: Ensure attribute values come from proper domain.
Key Constraints: Ensure tuples are uniquely identifiable.
Entity Integrity: Primary key attributes cannot be null.
Referential Integrity: Foreign key values must match existing primary key values or be null.
General Constraints: Arbitrary business rules (e.g., salary must not exceed manager's salary).
Keys are fundamental to the relational model, providing tuple identification and relationship representation.
Superkey: A set of attributes that uniquely identifies tuples in a relation. Any superset of a superkey is also a superkey.
Example: In Student relation, {sid} is a superkey. So is {sid, sname}—though unnecessarily large.
Candidate Key: A minimal superkey—no proper subset is a superkey. A relation may have multiple candidate keys.
Example: In Student relation, if email is also unique, both {sid} and {email} are candidate keys.
Primary Key: The candidate key chosen for tuple identification. Primary keys:
- Cannot contain null values
- Should be stable (rarely change)
- Should be minimal (single attribute when possible)
- Are often indexed for fast access
Foreign Key: An attribute (or set of attributes) in one relation that references the primary key of another relation. Foreign keys represent relationships between relations.
Example:
Enrollment(sid, cid, grade)
-- sid is foreign key referencing Student(sid)
-- cid is foreign key referencing Course(cid)Alternate Key: Candidate keys not chosen as primary key.
Composite Key: A key consisting of multiple attributes.
Natural Key vs Surrogate Key:
- Natural key: Uses existing attributes (e.g., Social Security Number)
- Surrogate key: Artificially generated (e.g., auto-increment integer)
Surrogate keys are often preferred because they are stable, simple, and independent of business changes.
Entity Integrity:
The entity integrity rule states that no primary key attribute can accept null values. This ensures every tuple is uniquely identifiable.
Example violation:
INSERT INTO Student(sid, sname, major, credits)
VALUES (NULL, 'David', 'History', 24); -- Invalid: primary key is nullReferential Integrity:
The referential integrity rule ensures that foreign key values either:
- Match a primary key value in the referenced relation, or
- Are completely null (if the foreign key allows nulls)
Example violation:
INSERT INTO Enrollment(sid, cid, grade)
VALUES (999, 'CS101', 'A'); -- Invalid if no student with sid=999 existsReferential Integrity Actions:
When referenced data changes, referential integrity defines how to respond:
ON DELETE CASCADE: Delete referencing tuples when referenced tuple deleted ON DELETE SET NULL: Set foreign key to null when referenced tuple deleted ON DELETE SET DEFAULT: Set foreign key to default value ON DELETE RESTRICT/NO ACTION: Prevent deletion if references exist ON UPDATE CASCADE: Propagate key changes to foreign keys ON UPDATE SET NULL/SET DEFAULT: Similar to delete options
General Constraints:
General constraints enforce business rules beyond key and referential integrity.
Attribute-based CHECK constraints:
CREATE TABLE Employee (
emp_id INTEGER PRIMARY KEY,
salary DECIMAL(10,2) CHECK (salary >= 0),
birth_date DATE CHECK (birth_date < CURRENT_DATE)
);Tuple-based CHECK constraints:
CREATE TABLE ProjectAssignment (
emp_id INTEGER REFERENCES Employee,
project_id INTEGER REFERENCES Project,
hours_allocated DECIMAL(5,2),
start_date DATE,
end_date DATE,
CHECK (start_date <= end_date),
CHECK (hours_allocated > 0)
);Assertions (table-independent constraints):
CREATE ASSERTION salary_ratio CHECK (
NOT EXISTS (
SELECT * FROM Employee e1, Employee e2
WHERE e1.department = e2.department
AND e1.salary > e2.salary * 10
)
);Relational algebra provides a formal foundation for querying relational databases. It consists of operations that take relations as input and produce new relations as output—the closure property enables composition of operations.
Fundamental Operations:
Selection (σ): Selects tuples satisfying a condition. σcondition(R) returns all tuples in R that satisfy condition.
Example: σmajor='CS'(Student) returns all CS students.
Projection (π): Selects specified attributes, removing duplicates. πattributes(R) returns relation with only named attributes.
Example: πsname,major(Student) returns student names and majors.
Union (∪): Returns tuples in either of two union-compatible relations. R ∪ S requires R and S have same attributes and domains.
Set Difference (−): Returns tuples in first relation but not second. R − S requires union compatibility.
Cartesian Product (×): Combines every tuple of first relation with every tuple of second. R × S produces tuples combining attributes of both relations.
Rename (ρ): Changes relation or attribute names. ρnewname(R) or ρnewattributes(R)
Additional Operations:
Intersection (∩): Tuples in both relations (derived from difference). R ∩ S = R − (R − S)
Join Operations:
Theta Join (⋈θ): Cartesian product followed by selection. R ⋈θ S = σθ(R × S)
Equijoin: Theta join with equality condition.
Natural Join (⋈): Equijoin over all common attributes, with duplicate columns removed.
Outer Joins: Preserve tuples even when no match exists.
- Left outer join (⟕): Preserves all left relation tuples
- Right outer join (⟖): Preserves all right relation tuples
- Full outer join (⟗): Preserves all tuples from both
Division (÷): Useful for "all" queries (e.g., students who take all courses). R ÷ S finds tuples in R that combine with all tuples in S.
Example Queries:
Given schema: Student(sid, sname, major) Course(cid, ctitle, credits) Enrollment(sid, cid, grade)
Query 1: Find names of CS students πsname(σmajor='CS'(Student))
Query 2: Find IDs of students enrolled in 'Database' course πEnrollment.sid(σctitle='Database'(Course ⋈ Enrollment))
Query 3: Find names of students who got 'A' in any course πsname(Student ⋈ (σgrade='A'(Enrollment)))
Query 4: Find students who take all courses πsid,sname(Student ⋈ ( (πsid,cid(Enrollment) ÷ πcid(Course)) ))
Relational calculus provides a declarative alternative to relational algebra—specifying what to retrieve rather than how to retrieve it.
Tuple Relational Calculus (TRC):
TRC expressions have form: { t | P(t) } where t is a tuple variable and P is a formula.
Example TRC queries:
Query 1: Find all students { s | Student(s) }
Query 2: Find names of CS students { t.sname | Student(t) AND t.major = 'CS' }
Query 3: Find students enrolled in 'CS101' { s | Student(s) AND ∃ e (Enrollment(e) AND e.sid = s.sid AND e.cid = 'CS101') }
Query 4: Find students who take all courses { s | Student(s) AND ∀ c (Course(c) ⇒ ∃ e (Enrollment(e) AND e.sid = s.sid AND e.cid = c.cid)) }
Domain Relational Calculus (DRC):
DRC uses domain variables instead of tuple variables. Expressions have form: { <x₁, ..., xₙ> | P(x₁, ..., xₙ) }
Example DRC queries:
Query 1: Find all student IDs and names { <s, n> | ∃ m, c (Student(s, n, m, c)) }
Query 2: Find names of CS students { | ∃ s, m, c (Student(s, n, m, c) AND m = 'CS') }
Query 3: Find students enrolled in 'CS101' { <s, n> | ∃ m, c (Student(s, n, m, c)) AND ∃ g (Enrollment(s, 'CS101', g)) }
Safety of Expressions:
Both TRC and DRC must ensure expressions produce finite results. Unsafe expressions (e.g., { t | NOT Student(t) }) would generate infinite results and are disallowed.
Views are virtual relations defined by queries, providing customized perspectives on the database.
Types of Views:
Virtual Views: Stored definition only; materialized when queried. Materialized Views: Physically stored and maintained; faster access but requires maintenance.
View Definition:
CREATE VIEW CSstudents AS
SELECT sid, sname, credits
FROM Student
WHERE major = 'Computer Science';
CREATE VIEW StudentGrades AS
SELECT s.sid, s.sname, c.cid, c.ctitle, e.grade
FROM Student s, Enrollment e, Course c
WHERE s.sid = e.sid AND e.cid = c.cid;Advantages of Views:
- Security: Hide sensitive columns (e.g., salary)
- Simplicity: Present simplified, relevant data to users
- Consistency: Provide uniform interface despite schema changes
- Integrity: Enforce additional constraints through WITH CHECK OPTION
Updatable Views:
Not all views support updates. Conditions for updatability:
- View based on single base table
- View includes all NOT NULL columns without defaults
- No aggregation, DISTINCT, GROUP BY, or set operations
- No subqueries in SELECT list
WITH CHECK OPTION:
Ensures modifications through view maintain view membership:
CREATE VIEW SeniorStudents AS
SELECT * FROM Student WHERE classification = 'Senior'
WITH CHECK OPTION;
-- Prevents updating classification to something other than 'Senior'Null represents missing or unknown information—a concept that complicates relational querying.
Interpretations of Null:
- Value unknown: Employee's birth date not recorded
- Value not applicable: Car's number of cylinders for a bicycle
- Value withheld: Employee's salary confidential
Three-Valued Logic:
SQL uses three truth values: TRUE, FALSE, UNKNOWN.
Truth Tables:
AND:
- TRUE AND UNKNOWN = UNKNOWN
- FALSE AND UNKNOWN = FALSE
- UNKNOWN AND UNKNOWN = UNKNOWN
OR:
- TRUE OR UNKNOWN = TRUE
- FALSE OR UNKNOWN = UNKNOWN
- UNKNOWN OR UNKNOWN = UNKNOWN
NOT:
- NOT UNKNOWN = UNKNOWN
Comparison with Null:
Any comparison involving null yields UNKNOWN:
- 5 = NULL → UNKNOWN
- NULL = NULL → UNKNOWN
- NULL IS NULL → TRUE (special test)
Query Behavior:
WHERE clause returns tuples where condition is TRUE (UNKNOWN tuples excluded).
Example:
SELECT * FROM Employee WHERE salary > 50000;
-- Employees with null salary are NOT returnedHandling Nulls:
Special operators and functions:
- IS NULL / IS NOT NULL
- COALESCE(value1, value2, ...) returns first non-null
- NULLIF(expr1, expr2) returns null if equal
- NVL/IFNULL (DBMS-specific)
Impact on Aggregates:
- COUNT(*) includes all rows
- COUNT(column) excludes nulls
- SUM, AVG ignore nulls
- GROUP BY treats nulls as a single group
SQL (Structured Query Language) originated from IBM's System R project in the 1970s, based on Codd's relational model. Originally called SEQUEL (Structured English Query Language), the name was shortened to SQL due to trademark issues.
Evolution of SQL Standards:
SQL-86 (SQL1): First ANSI standard, establishing basic DDL and DML.
SQL-89 (SQL1.5): Minor revisions, added integrity constraints.
SQL-92 (SQL2): Major revision, substantial expansion:
- New data types (DATE, TIME, TIMESTAMP)
- Enhanced joins (NATURAL JOIN, OUTER JOIN)
- Set operations (UNION, INTERSECT, EXCEPT)
- Schema manipulation statements
- Dynamic SQL
SQL:1999 (SQL3): Major extensions:
- Object-relational features (UDT, references)
- Recursive queries (WITH RECURSIVE)
- Triggers
- OLAP functions
- Java integration
SQL:2003: Introduced:
- XML-related features
- Window functions
- Sequences
- Identity columns
- MERGE statement
SQL:2006: Enhanced XML support.
SQL:2008: Added:
- TRUNCATE statement
- FETCH first clause
- Compound statements
SQL:2011: Added:
- Temporal databases (periods, time-travel queries)
- Enhanced window functions
SQL:2016: Added:
- JSON support
- Polymorphic table functions
- Row pattern matching
DDL defines and modifies database structures.
CREATE Statements:
Create Database:
CREATE DATABASE university
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;Create Table:
CREATE TABLE students (
student_id INTEGER AUTO_INCREMENT,
student_name VARCHAR(100) NOT NULL,
date_of_birth DATE,
email VARCHAR(200) UNIQUE,
enrollment_date DATE DEFAULT CURRENT_DATE,
major_code CHAR(4),
credits_completed INTEGER DEFAULT 0 CHECK (credits_completed >= 0),
status ENUM('active', 'graduated', 'suspended') DEFAULT 'active',
PRIMARY KEY (student_id),
FOREIGN KEY (major_code) REFERENCES departments(dept_code)
) ENGINE=InnoDB;Create Schema:
CREATE SCHEMA registration AUTHORIZATION dbadmin;ALTER Statements:
Add Column:
ALTER TABLE students
ADD COLUMN phone_number VARCHAR(20) AFTER email;Modify Column:
ALTER TABLE students
MODIFY COLUMN student_name VARCHAR(150) NOT NULL;Drop Column:
ALTER TABLE students
DROP COLUMN phone_number;Add Constraint:
ALTER TABLE students
ADD CONSTRAINT chk_credits CHECK (credits_completed <= 150);DROP Statements:
DROP TABLE students; -- Removes table and data
DROP TABLE students CASCADE CONSTRAINTS; -- Drops dependent objects
DROP DATABASE university;
DROP SCHEMA registration RESTRICT; -- Fails if schema contains objectsTRUNCATE:
TRUNCATE TABLE temp_enrollments; -- Fast removal of all rowsDML manipulates data within database structures.
INSERT:
Insert single row:
INSERT INTO students (student_id, student_name, email, major_code)
VALUES (1001, 'Alice Johnson', 'alice@university.edu', 'CS');Insert with defaults:
INSERT INTO students (student_name, email, major_code)
VALUES ('Bob Smith', 'bob@university.edu', 'MATH');
-- student_id auto-generates, enrollment_date defaults to todayInsert multiple rows:
INSERT INTO students (student_name, email, major_code) VALUES
('Carol White', 'carol@university.edu', 'PHYS'),
('David Brown', 'david@university.edu', 'CS'),
('Eva Green', 'eva@university.edu', 'MATH');Insert from query:
INSERT INTO honors_students (student_id, student_name, gpa)
SELECT s.student_id, s.student_name,
AVG(e.grade_points) as gpa
FROM students s
JOIN enrollments e ON s.student_id = e.student_id
GROUP BY s.student_id, s.student_name
HAVING AVG(e.grade_points) >= 3.5;UPDATE:
Update all rows:
UPDATE students
SET status = 'graduated'
WHERE student_id IN (SELECT student_id FROM graduates_2023);Update with expression:
UPDATE products
SET price = price * 1.10
WHERE category = 'electronics';Update with joins:
UPDATE students s
JOIN enrollments e ON s.student_id = e.student_id
SET s.credits_completed = s.credits_completed + e.credits_earned
WHERE e.semester = '202401' AND e.grade != 'F';DELETE:
Delete specific rows:
DELETE FROM students
WHERE status = 'graduated' AND graduation_date < '2020-01-01';Delete all rows:
DELETE FROM temp_log; -- Slow for large tables
TRUNCATE TABLE temp_log; -- Fast, but cannot rollbackMERGE (Upsert):
MERGE INTO products_target t
USING products_source s ON t.product_id = s.product_id
WHEN MATCHED THEN
UPDATE SET t.price = s.price, t.stock = s.stock
WHEN NOT MATCHED THEN
INSERT (product_id, product_name, price, stock)
VALUES (s.product_id, s.product_name, s.price, s.stock);DCL manages permissions and access control.
GRANT:
-- Grant specific privileges
GRANT SELECT, INSERT ON students TO registrar_user;
-- Grant all privileges
GRANT ALL PRIVILEGES ON university.* TO dbadmin@localhost;
-- Grant with grant option (can pass on privileges)
GRANT SELECT ON courses TO instructor WITH GRANT OPTION;
-- Grant at column level
GRANT UPDATE (grade) ON enrollments TO instructors;REVOKE:
-- Revoke specific privileges
REVOKE INSERT ON students FROM registrar_user;
-- Revoke all privileges
REVOKE ALL PRIVILEGES ON university.* FROM dbadmin@localhost;
-- Revoke with cascade/restrict
REVOKE SELECT ON courses FROM instructor CASCADE;The SELECT statement forms the foundation of data retrieval.
Basic Syntax:
SELECT [DISTINCT] select_list
FROM table_list
[WHERE search_condition]
[GROUP BY group_by_list]
[HAVING search_condition]
[ORDER BY order_by_list]
[LIMIT row_count];Simple SELECT:
SELECT * FROM students; -- All columns
SELECT student_name, email FROM students; -- Specific columns
SELECT DISTINCT major_code FROM students; -- Unique values
SELECT student_name AS name, email AS contact -- Column aliases
FROM students;SELECT with Expressions:
SELECT student_name,
TIMESTAMPDIFF(YEAR, date_of_birth, CURDATE()) AS age,
CONCAT(last_name, ', ', first_name) AS full_name
FROM students;SELECT with Literals:
SELECT student_name,
'Active' AS status,
CURRENT_DATE AS query_date
FROM students
WHERE status = 'active';WHERE clause filters rows based on conditions.
Comparison Operators:
-- Numeric comparisons
SELECT * FROM products WHERE price > 100;
SELECT * FROM products WHERE price BETWEEN 50 AND 100;
SELECT * FROM products WHERE price IN (19.99, 29.99, 39.99);
-- String comparisons
SELECT * FROM students WHERE student_name LIKE 'A%'; -- Starts with A
SELECT * FROM students WHERE student_name LIKE '%son'; -- Ends with 'son'
SELECT * FROM students WHERE email LIKE '%@university.edu';
-- Date comparisons
SELECT * FROM orders WHERE order_date >= '2024-01-01';
SELECT * FROM students WHERE YEAR(enrollment_date) = 2024;Logical Operators:
-- AND: both conditions true
SELECT * FROM students
WHERE major_code = 'CS' AND credits_completed > 60;
-- OR: either condition true
SELECT * FROM students
WHERE major_code = 'CS' OR major_code = 'MATH';
-- NOT: negate condition
SELECT * FROM students
WHERE NOT (status = 'graduated');NULL Handling:
-- Check for NULL
SELECT * FROM students WHERE graduation_date IS NULL;
-- Check for NOT NULL
SELECT * FROM students WHERE phone_number IS NOT NULL;
-- COALESCE for default values
SELECT student_name,
COALESCE(phone_number, 'No phone') AS contact
FROM students;ORDER BY (Sorting):
-- Ascending order (default)
SELECT student_name, credits_completed
FROM students
ORDER BY credits_completed;
-- Descending order
SELECT student_name, credits_completed
FROM students
ORDER BY credits_completed DESC;
-- Multiple sort columns
SELECT student_name, major_code, credits_completed
FROM students
ORDER BY major_code ASC, credits_completed DESC;
-- Sorting with expressions
SELECT student_name, TIMESTAMPDIFF(YEAR, date_of_birth, CURDATE()) AS age
FROM students
ORDER BY age DESC;GROUP BY (Aggregation):
-- Basic grouping
SELECT major_code, COUNT(*) AS student_count
FROM students
GROUP BY major_code;
-- Multiple grouping columns
SELECT major_code, status, COUNT(*) AS count
FROM students
GROUP BY major_code, status;
-- Group by expression
SELECT YEAR(enrollment_date) AS enrollment_year,
COUNT(*) AS new_students
FROM students
GROUP BY YEAR(enrollment_date);HAVING filters groups (similar to WHERE filtering rows).
-- Filter groups
SELECT major_code, AVG(credits_completed) AS avg_credits
FROM students
GROUP BY major_code
HAVING AVG(credits_completed) > 30;
-- Multiple conditions
SELECT major_code,
COUNT(*) AS student_count,
AVG(credits_completed) AS avg_credits
FROM students
GROUP BY major_code
HAVING COUNT(*) > 10 AND AVG(credits_completed) BETWEEN 30 AND 60;
-- HAVING vs WHERE
SELECT major_code, COUNT(*) AS student_count
FROM students
WHERE status = 'active' -- Filter rows before grouping
GROUP BY major_code
HAVING COUNT(*) > 5; -- Filter groups after aggregationAggregate Functions:
-- COUNT: number of rows
SELECT COUNT(*) FROM students; -- Total rows
SELECT COUNT(email) FROM students; -- Non-null emails
SELECT COUNT(DISTINCT major_code) FROM students; -- Unique majors
-- SUM: total of numeric column
SELECT SUM(credits_completed) FROM students;
-- AVG: average of numeric column
SELECT AVG(credits_completed) FROM students;
-- MIN/MAX: minimum/maximum values
SELECT MIN(credits_completed), MAX(credits_completed) FROM students;
-- Multiple aggregates
SELECT
COUNT(*) AS total_students,
AVG(credits_completed) AS avg_credits,
MIN(credits_completed) AS min_credits,
MAX(credits_completed) AS max_credits
FROM students;String Functions:
-- Concatenation
SELECT CONCAT(first_name, ' ', last_name) AS full_name FROM students;
-- Substring
SELECT SUBSTRING(email, 1, LOCATE('@', email) - 1) AS username FROM students;
-- Case conversion
SELECT UPPER(last_name), LOWER(first_name) FROM students;
-- Length
SELECT student_name, LENGTH(student_name) AS name_length FROM students;
-- Trimming
SELECT TRIM(' ' FROM student_name) FROM students;
-- Replacement
SELECT REPLACE(email, '@university.edu', '@alumni.edu') FROM students;Date/Time Functions:
-- Current date/time
SELECT CURRENT_DATE, CURRENT_TIME, CURRENT_TIMESTAMP;
-- Date parts
SELECT
student_name,
YEAR(date_of_birth) AS birth_year,
MONTH(date_of_birth) AS birth_month,
DAY(date_of_birth) AS birth_day
FROM students;
-- Date arithmetic
SELECT
student_name,
DATEDIFF(CURRENT_DATE, enrollment_date) AS days_enrolled,
DATE_ADD(enrollment_date, INTERVAL 4 YEAR) AS expected_graduation
FROM students;
-- Date formatting
SELECT
student_name,
DATE_FORMAT(enrollment_date, '%M %d, %Y') AS formatted_date
FROM students;Numeric Functions:
-- Rounding
SELECT price, ROUND(price, 2) FROM products;
SELECT price, CEILING(price), FLOOR(price) FROM products;
-- Absolute value
SELECT ABS(balance) FROM accounts;
-- Random
SELECT RAND() FROM DUAL; -- Random number between 0 and 1
-- Mathematical functions
SELECT
price,
POWER(price, 2) AS price_squared,
SQRT(price) AS price_sqrt
FROM products;Conditional Functions:
-- CASE statement
SELECT
student_name,
credits_completed,
CASE
WHEN credits_completed >= 120 THEN 'Senior'
WHEN credits_completed >= 90 THEN 'Junior'
WHEN credits_completed >= 60 THEN 'Sophomore'
WHEN credits_completed >= 30 THEN 'Freshman'
ELSE 'First Term'
END AS class_level
FROM students;
-- IF function (MySQL)
SELECT
student_name,
IF(status = 'active', 'Currently enrolled', 'Not active') AS enrollment_status
FROM students;
-- COALESCE (first non-null)
SELECT
student_name,
COALESCE(phone_number, email, 'No contact') AS contact
FROM students;
-- NULLIF (null if equal)
SELECT
student_name,
NULLIF(status, 'active') AS non_active_status
FROM students;Column-Level Constraints:
CREATE TABLE employees (
emp_id INTEGER PRIMARY KEY, -- Primary key
ssn CHAR(11) UNIQUE, -- Unique constraint
first_name VARCHAR(50) NOT NULL, -- Not null
last_name VARCHAR(50) NOT NULL,
hire_date DATE DEFAULT CURRENT_DATE, -- Default value
salary DECIMAL(10,2) CHECK (salary > 0), -- Check constraint
dept_id INTEGER REFERENCES departments(dept_id) -- Foreign key
);Table-Level Constraints:
CREATE TABLE enrollments (
student_id INTEGER,
course_id CHAR(8),
semester CHAR(6),
grade CHAR(2),
enrollment_date DATE DEFAULT CURRENT_DATE,
-- Table-level primary key (composite)
CONSTRAINT pk_enrollment PRIMARY KEY (student_id, course_id, semester),
-- Table-level foreign keys
CONSTRAINT fk_enroll_student FOREIGN KEY (student_id)
REFERENCES students(student_id) ON DELETE CASCADE,
CONSTRAINT fk_enroll_course FOREIGN KEY (course_id)
REFERENCES courses(course_id) ON DELETE RESTRICT,
-- Table-level check constraint
CONSTRAINT chk_grade CHECK (grade IN ('A', 'B', 'C', 'D', 'F', 'I'))
);Adding Constraints to Existing Tables:
-- Primary key
ALTER TABLE employees ADD CONSTRAINT pk_emp PRIMARY KEY (emp_id);
-- Foreign key
ALTER TABLE employees ADD CONSTRAINT fk_emp_dept
FOREIGN KEY (dept_id) REFERENCES departments(dept_id);
-- Unique constraint
ALTER TABLE employees ADD CONSTRAINT uk_emp_ssn UNIQUE (ssn);
-- Check constraint
ALTER TABLE employees ADD CONSTRAINT chk_salary CHECK (salary >= 0);
-- Not null (special case)
ALTER TABLE employees MODIFY first_name VARCHAR(50) NOT NULL;Dropping Constraints:
ALTER TABLE employees DROP PRIMARY KEY;
ALTER TABLE employees DROP FOREIGN KEY fk_emp_dept;
ALTER TABLE employees DROP CONSTRAINT chk_salary;
ALTER TABLE employees MODIFY first_name VARCHAR(50) NULL; -- Drop not nullSubqueries are queries nested within other queries, providing powerful data retrieval capabilities.
Scalar Subqueries (Return Single Value):
-- Subquery in SELECT clause
SELECT
student_name,
(SELECT AVG(grade_points)
FROM enrollments e
WHERE e.student_id = s.student_id) AS avg_grade
FROM students s;
-- Subquery in WHERE clause
SELECT student_name, major_code
FROM students
WHERE credits_completed > (SELECT AVG(credits_completed) FROM students);
-- Subquery in HAVING clause
SELECT major_code, AVG(credits_completed) AS avg_credits
FROM students
GROUP BY major_code
HAVING AVG(credits_completed) > (SELECT AVG(credits_completed) FROM students);Row Subqueries (Return Single Row):
-- Compare with multiple columns
SELECT student_name, major_code, credits_completed
FROM students
WHERE (major_code, credits_completed) =
(SELECT major_code, MAX(credits_completed)
FROM students
GROUP BY major_code
LIMIT 1);Table Subqueries (Return Multiple Rows):
-- Subquery in FROM clause (derived table)
SELECT dept_name, avg_credits
FROM departments d
JOIN (
SELECT major_code, AVG(credits_completed) AS avg_credits
FROM students
GROUP BY major_code
) dept_stats ON d.dept_code = dept_stats.major_code;
-- Subquery with IN
SELECT student_name
FROM students
WHERE student_id IN (
SELECT DISTINCT student_id
FROM enrollments
WHERE grade = 'A'
);
-- Subquery with EXISTS
SELECT d.dept_name
FROM departments d
WHERE EXISTS (
SELECT 1
FROM courses c
WHERE c.dept_code = d.dept_code
AND c.credits > 3
);Correlated Subqueries:
Correlated subqueries reference columns from the outer query, executing once for each outer row.
-- Find students with above-average credits in their major
SELECT s1.student_name, s1.major_code, s1.credits_completed
FROM students s1
WHERE s1.credits_completed > (
SELECT AVG(s2.credits_completed)
FROM students s2
WHERE s2.major_code = s1.major_code
);
-- Find courses with above-average enrollment
SELECT c.course_id, c.title,
(SELECT COUNT(*)
FROM enrollments e
WHERE e.course_id = c.course_id) AS enrollment_count
FROM courses c
WHERE (
SELECT COUNT(*)
FROM enrollments e
WHERE e.course_id = c.course_id
) > (
SELECT AVG(enrollment_count)
FROM (
SELECT COUNT(*) AS enrollment_count
FROM enrollments
GROUP BY course_id
) avg_enrollments
);Subquery Operators:
-- ALL: compare with all values
SELECT student_name, credits_completed
FROM students
WHERE credits_completed > ALL (
SELECT credits_completed
FROM students
WHERE major_code = 'MATH'
);
-- ANY/SOME: compare with any value
SELECT student_name, credits_completed
FROM students
WHERE credits_completed > ANY (
SELECT credits_completed
FROM students
WHERE major_code = 'MATH'
);
-- EXISTS: check for existence
SELECT d.dept_name
FROM departments d
WHERE EXISTS (
SELECT 1
FROM faculty f
WHERE f.dept_code = d.dept_code
AND f.rank = 'Professor'
AND f.salary > 150000
);Joins combine rows from multiple tables based on related columns.
INNER JOIN:
Returns only rows with matches in both tables.
-- Basic inner join
SELECT s.student_name, c.title, e.grade
FROM students s
INNER JOIN enrollments e ON s.student_id = e.student_id
INNER JOIN courses c ON e.course_id = c.course_id;
-- Using USING when column names match
SELECT student_name, title, grade
FROM students
JOIN enrollments USING (student_id)
JOIN courses USING (course_id);
-- Multiple join conditions
SELECT e1.student_id, e1.course_id, e1.grade, e2.grade AS prev_grade
FROM enrollments e1
JOIN enrollments e2 ON e1.student_id = e2.student_id
AND e1.course_id = e2.course_id
AND e1.semester > e2.semester;OUTER JOIN:
Preserves rows even when no match exists.
-- LEFT OUTER JOIN: all rows from left table
SELECT s.student_name, e.course_id, e.grade
FROM students s
LEFT JOIN enrollments e ON s.student_id = e.student_id;
-- RIGHT OUTER JOIN: all rows from right table
SELECT s.student_name, e.course_id, e.grade
FROM enrollments e
RIGHT JOIN students s ON e.student_id = s.student_id;
-- FULL OUTER JOIN: all rows from both tables
SELECT s.student_name, e.course_id, e.grade
FROM students s
FULL OUTER JOIN enrollments e ON s.student_id = e.student_id;SELF JOIN:
Joining a table with itself.
-- Find employees and their managers
SELECT e1.emp_name AS employee,
e2.emp_name AS manager
FROM employees e1
LEFT JOIN employees e2 ON e1.manager_id = e2.emp_id;
-- Find pairs of students in same major
SELECT s1.student_name AS student1,
s2.student_name AS student2,
s1.major_code
FROM students s1
JOIN students s2 ON s1.major_code = s2.major_code
AND s1.student_id < s2.student_id;CROSS JOIN:
Cartesian product of all rows.
-- All possible combinations
SELECT s.student_name, c.title
FROM students s
CROSS JOIN courses c;
-- Generate date range
SELECT dates.date, COUNT(e.enrollment_id)
FROM (
SELECT DATE_ADD('2024-01-01', INTERVAL n DAY) AS date
FROM numbers -- Table with numbers 0 to 365
WHERE DATE_ADD('2024-01-01', INTERVAL n DAY) <= '2024-12-31'
) dates
LEFT JOIN enrollments e ON DATE(e.enrollment_date) = dates.date
GROUP BY dates.date;NATURAL JOIN:
Automatically joins on columns with same names (use cautiously).
-- Automatically joins on common column names
SELECT student_name, title, grade
FROM students
NATURAL JOIN enrollments
NATURAL JOIN courses;Set operations combine results from multiple queries.
UNION and UNION ALL:
-- UNION removes duplicates
SELECT student_id, student_name FROM students
WHERE major_code = 'CS'
UNION
SELECT student_id, student_name FROM students
WHERE credits_completed > 100;
-- UNION ALL preserves duplicates (faster)
SELECT student_id FROM enrollments WHERE semester = '202401'
UNION ALL
SELECT student_id FROM enrollments WHERE semester = '202402';
-- Combining different tables
SELECT 'Student' AS type, student_name AS name FROM students
UNION
SELECT 'Faculty' AS type, faculty_name AS name FROM faculty
UNION
SELECT 'Staff' AS type, staff_name AS name FROM staff;INTERSECT:
Returns rows common to both queries.
-- Students who took both CS101 and MATH201
SELECT student_id FROM enrollments WHERE course_id = 'CS101'
INTERSECT
SELECT student_id FROM enrollments WHERE course_id = 'MATH201';
-- Equivalent using EXISTS
SELECT DISTINCT student_id
FROM enrollments e1
WHERE course_id = 'CS101'
AND EXISTS (
SELECT 1 FROM enrollments e2
WHERE e2.student_id = e1.student_id
AND e2.course_id = 'MATH201'
);EXCEPT (or MINUS):
Returns rows from first query not in second.
-- Students who took CS101 but not MATH201
SELECT student_id FROM enrollments WHERE course_id = 'CS101'
EXCEPT
SELECT student_id FROM enrollments WHERE course_id = 'MATH201';
-- Courses with no enrollments
SELECT course_id FROM courses
EXCEPT
SELECT DISTINCT course_id FROM enrollments;Window functions perform calculations across sets of rows related to the current row.
ROW_NUMBER, RANK, DENSE_RANK:
-- Assign row numbers
SELECT
student_name,
credits_completed,
ROW_NUMBER() OVER (ORDER BY credits_completed DESC) AS row_num,
RANK() OVER (ORDER BY credits_completed DESC) AS rank,
DENSE_RANK() OVER (ORDER BY credits_completed DESC) AS dense_rank
FROM students;
-- Partition by major
SELECT
student_name,
major_code,
credits_completed,
ROW_NUMBER() OVER (PARTITION BY major_code
ORDER BY credits_completed DESC) AS rank_in_major
FROM students;LAG and LEAD:
Access previous/next row values.
-- Compare with previous enrollment
SELECT
student_id,
semester,
grade_points,
LAG(grade_points, 1) OVER (PARTITION BY student_id
ORDER BY semester) AS prev_gpa,
grade_points - LAG(grade_points, 1, grade_points)
OVER (PARTITION BY student_id ORDER BY semester) AS gpa_change
FROM student_gpa;
-- LEAD for next value
SELECT
employee_id,
salary,
effective_date,
LEAD(effective_date, 1, '9999-12-31')
OVER (PARTITION BY employee_id ORDER BY effective_date) AS next_date
FROM salary_history;Aggregate Window Functions:
-- Running totals
SELECT
order_date,
amount,
SUM(amount) OVER (ORDER BY order_date) AS running_total,
AVG(amount) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg
FROM orders
WHERE customer_id = 123;
-- Percentage of total
SELECT
department,
employee_name,
salary,
salary / SUM(salary) OVER (PARTITION BY department) * 100 AS pct_of_dept
FROM employees;FIRST_VALUE and LAST_VALUE:
-- First and last in group
SELECT
department,
employee_name,
salary,
FIRST_VALUE(employee_name) OVER (PARTITION BY department
ORDER BY salary DESC) AS highest_paid,
LAST_VALUE(employee_name) OVER (PARTITION BY department
ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS lowest_paid
FROM employees;NTILE:
Divide rows into buckets.
-- Divide students into quartiles by credits
SELECT
student_name,
credits_completed,
NTILE(4) OVER (ORDER BY credits_completed) AS quartile
FROM students;
-- Deciles by salary
SELECT
department,
employee_name,
salary,
NTILE(10) OVER (PARTITION BY department ORDER BY salary) AS salary_decile
FROM employees;CTEs provide temporary result sets within a query.
Basic CTE:
WITH dept_stats AS (
SELECT
major_code,
COUNT(*) AS student_count,
AVG(credits_completed) AS avg_credits
FROM students
GROUP BY major_code
)
SELECT d.dept_name, ds.student_count, ds.avg_credits
FROM departments d
JOIN dept_stats ds ON d.dept_code = ds.major_code;Multiple CTEs:
WITH
high_achievers AS (
SELECT student_id, AVG(grade_points) AS gpa
FROM enrollments
GROUP BY student_id
HAVING AVG(grade_points) > 3.5
),
active_students AS (
SELECT student_id, student_name
FROM students
WHERE status = 'active'
)
SELECT a.student_name, h.gpa
FROM active_students a
JOIN high_achievers h ON a.student_id = h.student_id;CTE for Hierarchical Data:
See Recursive Queries section (5.6).
Recursive CTEs enable querying hierarchical or graph-structured data.
Employee Hierarchy:
WITH RECURSIVE emp_hierarchy AS (
-- Anchor member: top-level employees
SELECT
emp_id,
emp_name,
manager_id,
1 AS level,
CAST(emp_name AS CHAR(1000)) AS path
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive member
SELECT
e.emp_id,
e.emp_name,
e.manager_id,
h.level + 1,
CONCAT(h.path, ' -> ', e.emp_name)
FROM employees e
JOIN emp_hierarchy h ON e.manager_id = h.emp_id
)
SELECT * FROM emp_hierarchy
ORDER BY path;Bill of Materials:
WITH RECURSIVE bom AS (
-- Base components
SELECT
part_id,
part_name,
quantity,
part_id AS root_part,
1 AS level
FROM parts
WHERE parent_id IS NULL
UNION ALL
-- Subcomponents
SELECT
p.part_id,
p.part_name,
p.quantity * b.quantity,
b.root_part,
b.level + 1
FROM parts p
JOIN bom b ON p.parent_id = b.part_id
)
SELECT
root_part,
part_id,
part_name,
quantity,
level
FROM bom
ORDER BY root_part, level;Graph Traversal:
-- Find all paths in a graph
WITH RECURSIVE paths AS (
-- Direct connections
SELECT
node_from,
node_to,
CAST(node_from AS CHAR(1000)) AS path,
1 AS length
FROM edges
UNION ALL
-- Extended paths
SELECT
p.node_from,
e.node_to,
CONCAT(p.path, ',', e.node_to),
p.length + 1
FROM paths p
JOIN edges e ON p.node_to = e.node_from
WHERE NOT FIND_IN_SET(e.node_to, REPLACE(p.path, ',', ','))
)
SELECT * FROM paths;Stored procedures encapsulate business logic within the database.
Basic Procedure:
DELIMITER //
CREATE PROCEDURE GetStudentInfo(IN student_id_param INT)
BEGIN
SELECT s.student_name, s.major_code, s.credits_completed,
COUNT(e.course_id) AS courses_taken,
AVG(e.grade_points) AS gpa
FROM students s
LEFT JOIN enrollments e ON s.student_id = e.student_id
WHERE s.student_id = student_id_param
GROUP BY s.student_id;
END//
DELIMITER ;
-- Call procedure
CALL GetStudentInfo(1001);Procedure with OUT Parameters:
DELIMITER //
CREATE PROCEDURE CalculateDepartmentStats(
IN dept_code_param CHAR(4),
OUT student_count INT,
OUT avg_credits DECIMAL(5,2),
OUT max_credits INT
)
BEGIN
SELECT COUNT(*), AVG(credits_completed), MAX(credits_completed)
INTO student_count, avg_credits, max_credits
FROM students
WHERE major_code = dept_code_param;
END//
DELIMITER ;
-- Call with OUT parameters
CALL CalculateDepartmentStats('CS', @count, @avg, @max);
SELECT @count, @avg, @max;Procedure with Condition Logic:
DELIMITER //
CREATE PROCEDURE EnrollStudent(
IN p_student_id INT,
IN p_course_id CHAR(8),
IN p_semester CHAR(6)
)
BEGIN
DECLARE v_enrollment_count INT;
DECLARE v_course_capacity INT;
DECLARE v_current_enrollments INT;
-- Start transaction
START TRANSACTION;
-- Check if already enrolled
SELECT COUNT(*) INTO v_enrollment_count
FROM enrollments
WHERE student_id = p_student_id
AND course_id = p_course_id
AND semester = p_semester;
IF v_enrollment_count > 0 THEN
ROLLBACK;
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'Student already enrolled in this course';
END IF;
-- Check course capacity
SELECT capacity INTO v_course_capacity
FROM course_sections
WHERE course_id = p_course_id AND semester = p_semester;
SELECT COUNT(*) INTO v_current_enrollments
FROM enrollments
WHERE course_id = p_course_id AND semester = p_semester;
IF v_current_enrollments >= v_course_capacity THEN
ROLLBACK;
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'Course section is full';
END IF;
-- Enroll student
INSERT INTO enrollments (student_id, course_id, semester, enrollment_date)
VALUES (p_student_id, p_course_id, p_semester, CURRENT_DATE);
COMMIT;
END//
DELIMITER ;Triggers automatically execute in response to database events.
BEFORE INSERT Trigger:
DELIMITER //
CREATE TRIGGER validate_enrollment
BEFORE INSERT ON enrollments
FOR EACH ROW
BEGIN
DECLARE v_student_status VARCHAR(20);
DECLARE v_prereq_complete BOOLEAN;
-- Check student status
SELECT status INTO v_student_status
FROM students
WHERE student_id = NEW.student_id;
IF v_student_status != 'active' THEN
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'Student is not active';
END IF;
-- Check prerequisites (simplified)
-- Additional validation logic here
END//
DELIMITER ;AFTER UPDATE Trigger:
DELIMITER //
CREATE TRIGGER update_student_gpa
AFTER UPDATE ON enrollments
FOR EACH ROW
BEGIN
DECLARE v_new_gpa DECIMAL(3,2);
-- Recalculate student's GPA
SELECT AVG(grade_points) INTO v_new_gpa
FROM enrollments
WHERE student_id = NEW.student_id
AND grade IS NOT NULL;
-- Update student record
UPDATE students
SET gpa = v_new_gpa
WHERE student_id = NEW.student_id;
END//
DELIMITER ;Triggers for Auditing:
DELIMITER //
CREATE TRIGGER audit_student_changes
AFTER UPDATE ON students
FOR EACH ROW
BEGIN
INSERT INTO audit_log (
table_name,
record_id,
action_type,
old_value,
new_value,
changed_by,
change_date
) VALUES (
'students',
OLD.student_id,
'UPDATE',
JSON_OBJECT(
'name', OLD.student_name,
'major', OLD.major_code,
'status', OLD.status
),
JSON_OBJECT(
'name', NEW.student_name,
'major', NEW.major_code,
'status', NEW.status
),
USER(),
NOW()
);
END//
DELIMITER ;Indexes dramatically improve query performance.
Basic Index Creation:
-- Single column index
CREATE INDEX idx_student_name ON students(student_name);
-- Unique index
CREATE UNIQUE INDEX idx_student_email ON students(email);
-- Composite index
CREATE INDEX idx_enrollments_lookup ON enrollments(student_id, course_id, semester);
-- Prefix index (for strings)
CREATE INDEX idx_course_title ON courses(title(20));Functional Indexes:
-- Index on expression (PostgreSQL)
CREATE INDEX idx_student_lower_email ON students(LOWER(email));
-- Index on function (Oracle)
CREATE INDEX idx_enrollment_year ON enrollments(EXTRACT(YEAR FROM enrollment_date));Partial Indexes:
-- Index only active students (PostgreSQL)
CREATE INDEX idx_active_students ON students(student_id)
WHERE status = 'active';
-- Index high-value orders
CREATE INDEX idx_large_orders ON orders(order_id)
WHERE amount > 1000;Invisible/Dropping Indexes:
-- Make index invisible (Oracle)
ALTER INDEX idx_student_name INVISIBLE;
-- Drop index
DROP INDEX idx_student_name ON students;Understanding and optimizing query performance.
EXPLAIN Plans:
-- MySQL
EXPLAIN SELECT s.student_name, c.title, e.grade
FROM students s
JOIN enrollments e ON s.student_id = e.student_id
JOIN courses c ON e.course_id = c.course_id
WHERE s.major_code = 'CS';
-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM students WHERE student_name LIKE 'Smith%';
-- Oracle
EXPLAIN PLAN FOR
SELECT * FROM students WHERE credits_completed > 100;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);Query Optimization Techniques:
-- Use EXISTS instead of IN for large subqueries
-- Slow
SELECT * FROM students
WHERE student_id IN (SELECT student_id FROM enrollments);
-- Fast
SELECT * FROM students s
WHERE EXISTS (SELECT 1 FROM enrollments e WHERE e.student_id = s.student_id);
-- Avoid functions on indexed columns
-- Bad (function on indexed column)
SELECT * FROM students WHERE YEAR(enrollment_date) = 2024;
-- Good
SELECT * FROM students
WHERE enrollment_date BETWEEN '2024-01-01' AND '2024-12-31';
-- Use appropriate join order
-- Place most selective conditions first
SELECT s.student_name, c.title
FROM students s
JOIN enrollments e ON s.student_id = e.student_id
JOIN courses c ON e.course_id = c.course_id
WHERE s.major_code = 'CS' -- Selective condition
AND e.semester = '202401';Index Usage Analysis:
-- Find unused indexes (PostgreSQL)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
ORDER BY idx_scan;
-- Index recommendations (SQL Server)
SELECT * FROM sys.dm_db_missing_index_details;PL/SQL (Procedural Language/SQL) is Oracle's procedural extension to SQL.
Block Structure:
DECLARE
-- Declaration section
v_student_name students.student_name%TYPE;
v_enrollment_count INTEGER := 0;
v_current_date DATE := SYSDATE;
CURSOR c_students IS
SELECT student_id, student_name FROM students WHERE status = 'active';
BEGIN
-- Executable section
SELECT student_name INTO v_student_name
FROM students
WHERE student_id = 1001;
DBMS_OUTPUT.PUT_LINE('Student: ' || v_student_name);
FOR student_rec IN c_students LOOP
DBMS_OUTPUT.PUT_LINE('Processing: ' || student_rec.student_name);
END LOOP;
EXCEPTION
-- Exception handling section
WHEN NO_DATA_FOUND THEN
DBMS_OUTPUT.PUT_LINE('Student not found');
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE('Error: ' || SQLERRM);
END;
/Stored Procedures in PL/SQL:
CREATE OR REPLACE PROCEDURE calculate_gpa (
p_student_id IN students.student_id%TYPE,
p_gpa OUT NUMBER
) IS
v_total_points NUMBER := 0;
v_total_credits NUMBER := 0;
BEGIN
SELECT SUM(c.credits *
CASE e.grade
WHEN 'A' THEN 4.0
WHEN 'B' THEN 3.0
WHEN 'C' THEN 2.0
WHEN 'D' THEN 1.0
ELSE 0
END),
SUM(c.credits)
INTO v_total_points, v_total_credits
FROM enrollments e
JOIN courses c ON e.course_id = c.course_id
WHERE e.student_id = p_student_id
AND e.grade IS NOT NULL;
IF v_total_credits > 0 THEN
p_gpa := v_total_points / v_total_credits;
ELSE
p_gpa := 0;
END IF;
EXCEPTION
WHEN NO_DATA_FOUND THEN
p_gpa := 0;
WHEN OTHERS THEN
RAISE_APPLICATION_ERROR(-20001, 'Error calculating GPA: ' || SQLERRM);
END calculate_gpa;
/Functions in PL/SQL:
CREATE OR REPLACE FUNCTION get_student_status (
p_student_id IN students.student_id%TYPE
) RETURN VARCHAR2 IS
v_status students.status%TYPE;
v_credits students.credits_completed%TYPE;
BEGIN
SELECT status, credits_completed
INTO v_status, v_credits
FROM students
WHERE student_id = p_student_id;
IF v_status = 'graduated' THEN
RETURN 'Graduated';
ELSIF v_credits >= 120 THEN
RETURN 'Eligible for Graduation';
ELSIF v_credits >= 90 THEN
RETURN 'Senior';
ELSIF v_credits >= 60 THEN
RETURN 'Junior';
ELSIF v_credits >= 30 THEN
RETURN 'Sophomore';
ELSE
RETURN 'Freshman';
END IF;
EXCEPTION
WHEN NO_DATA_FOUND THEN
RETURN 'Not Found';
END get_student_status;
/T-SQL (Transact-SQL) is Microsoft SQL Server's procedural extension.
Variables and Control Flow:
DECLARE @StudentID INT = 1001
DECLARE @StudentName NVARCHAR(100)
DECLARE @GPA DECIMAL(3,2)
-- IF-ELSE
IF EXISTS (SELECT 1 FROM students WHERE student_id = @StudentID)
BEGIN
SELECT @StudentName = student_name
FROM students
WHERE student_id = @StudentID
PRINT 'Found student: ' + @StudentName
-- WHILE loop
WHILE @StudentID < 1010
BEGIN
-- Process students
SET @StudentID = @StudentID + 1
END
END
ELSE
BEGIN
PRINT 'Student not found'
END
-- CASE expression
SELECT
student_name,
CASE
WHEN credits_completed >= 120 THEN 'Senior'
WHEN credits_completed >= 90 THEN 'Junior'
WHEN credits_completed >= 60 THEN 'Sophomore'
WHEN credits_completed >= 30 THEN 'Freshman'
ELSE 'First Year'
END AS class_level
FROM studentsStored Procedures in T-SQL:
CREATE PROCEDURE sp_EnrollStudent
@StudentID INT,
@CourseID CHAR(8),
@Semester CHAR(6)
AS
BEGIN
SET NOCOUNT ON;
BEGIN TRY
BEGIN TRANSACTION
-- Check if already enrolled
IF EXISTS (
SELECT 1 FROM enrollments
WHERE student_id = @StudentID
AND course_id = @CourseID
AND semester = @Semester
)
BEGIN
RAISERROR('Student already enrolled', 16, 1)
ROLLBACK TRANSACTION
RETURN
END
-- Check prerequisites (simplified)
-- Insert enrollment
INSERT INTO enrollments (student_id, course_id, semester, enrollment_date)
VALUES (@StudentID, @CourseID, @Semester, GETDATE())
COMMIT TRANSACTION
END TRY
BEGIN CATCH
IF @@TRANCOUNT > 0
ROLLBACK TRANSACTION
DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE()
DECLARE @ErrorSeverity INT = ERROR_SEVERITY()
DECLARE @ErrorState INT = ERROR_STATE()
RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState)
END CATCH
END
GOUser-Defined Functions:
-- Scalar function (returns single value)
CREATE FUNCTION dbo.CalculateAge
(
@BirthDate DATE
)
RETURNS INT
AS
BEGIN
DECLARE @Age INT
SET @Age = DATEDIFF(YEAR, @BirthDate, GETDATE())
-- Adjust for birthday not yet occurred this year
IF DATEADD(YEAR, @Age, @BirthDate) > GETDATE()
SET @Age = @Age - 1
RETURN @Age
END
GO
-- Table-valued function
CREATE FUNCTION dbo.GetDepartmentStudents
(
@DeptCode CHAR(4)
)
RETURNS TABLE
AS
RETURN
(
SELECT student_id, student_name, credits_completed
FROM students
WHERE major_code = @DeptCode
AND status = 'active'
)
GO
-- Usage
SELECT * FROM dbo.GetDepartmentStudents('CS')
WHERE credits_completed > 60Oracle Packages:
Packages group related procedures and functions.
CREATE OR REPLACE PACKAGE student_pkg IS
-- Public declarations
FUNCTION calculate_gpa(p_student_id IN NUMBER) RETURN NUMBER;
PROCEDURE enroll_student(
p_student_id IN NUMBER,
p_course_id IN VARCHAR2,
p_semester IN VARCHAR2
);
-- Public constant
c_min_credits CONSTANT NUMBER := 120;
-- Public exception
e_student_inactive EXCEPTION;
END student_pkg;
/
CREATE OR REPLACE PACKAGE BODY student_pkg IS
-- Private variable
v_debug_mode BOOLEAN := FALSE;
-- Private function
FUNCTION validate_student(p_student_id IN NUMBER) RETURN BOOLEAN IS
v_status students.status%TYPE;
BEGIN
SELECT status INTO v_status
FROM students
WHERE student_id = p_student_id;
RETURN v_status = 'active';
EXCEPTION
WHEN NO_DATA_FOUND THEN
RETURN FALSE;
END validate_student;
-- Public function implementation
FUNCTION calculate_gpa(p_student_id IN NUMBER) RETURN NUMBER IS
v_gpa NUMBER;
BEGIN
SELECT AVG(
CASE grade
WHEN 'A' THEN 4.0
WHEN 'B' THEN 3.0
WHEN 'C' THEN 2.0
WHEN 'D' THEN 1.0
ELSE 0
END
) INTO v_gpa
FROM enrollments
WHERE student_id = p_student_id
AND grade IS NOT NULL;
RETURN NVL(v_gpa, 0);
END calculate_gpa;
-- Public procedure implementation
PROCEDURE enroll_student(
p_student_id IN NUMBER,
p_course_id IN VARCHAR2,
p_semester IN VARCHAR2
) IS
BEGIN
IF NOT validate_student(p_student_id) THEN
RAISE e_student_inactive;
END IF;
INSERT INTO enrollments (student_id, course_id, semester, enrollment_date)
VALUES (p_student_id, p_course_id, p_semester, SYSDATE);
COMMIT;
EXCEPTION
WHEN DUP_VAL_ON_INDEX THEN
RAISE_APPLICATION_ERROR(-20002, 'Already enrolled');
WHEN e_student_inactive THEN
RAISE_APPLICATION_ERROR(-20003, 'Student not active');
END enroll_student;
END student_pkg;
/SQL Server Error Handling:
CREATE PROCEDURE sp_SafeTransfer
@FromAccount INT,
@ToAccount INT,
@Amount DECIMAL(10,2)
AS
BEGIN
SET NOCOUNT ON;
SET XACT_ABORT ON;
DECLARE @TranName VARCHAR(20) = 'TransferTran';
BEGIN TRY
BEGIN TRANSACTION @TranName;
-- Check sufficient funds
IF NOT EXISTS (
SELECT 1 FROM accounts
WHERE account_id = @FromAccount
AND balance >= @Amount
)
BEGIN
;THROW 50001, 'Insufficient funds', 1;
END
UPDATE accounts SET balance = balance - @Amount
WHERE account_id = @FromAccount;
UPDATE accounts SET balance = balance + @Amount
WHERE account_id = @ToAccount;
COMMIT TRANSACTION @TranName;
END TRY
BEGIN CATCH
IF @@TRANCOUNT > 0
ROLLBACK TRANSACTION @TranName;
DECLARE @ErrorMessage NVARCHAR(4000) = ERROR_MESSAGE();
DECLARE @ErrorSeverity INT = ERROR_SEVERITY();
DECLARE @ErrorState INT = ERROR_STATE();
-- Log error
INSERT INTO error_log (error_message, error_date, procedure_name)
VALUES (@ErrorMessage, GETDATE(), 'sp_SafeTransfer');
-- Re-raise
RAISERROR(@ErrorMessage, @ErrorSeverity, @ErrorState);
END CATCH
ENDOracle Exception Handling:
CREATE OR REPLACE PROCEDURE process_enrollments IS
-- User-defined exception
e_invalid_grade EXCEPTION;
PRAGMA EXCEPTION_INIT(e_invalid_grade, -20001);
-- Cursor with error handling
CURSOR c_enrollments IS
SELECT student_id, course_id, grade
FROM pending_enrollments;
BEGIN
FOR rec IN c_enrollments LOOP
BEGIN
-- Validate grade
IF rec.grade NOT IN ('A','B','C','D','F') THEN
RAISE e_invalid_grade;
END IF;
-- Insert valid enrollment
INSERT INTO enrollments (student_id, course_id, grade)
VALUES (rec.student_id, rec.course_id, rec.grade);
DELETE FROM pending_enrollments
WHERE student_id = rec.student_id
AND course_id = rec.course_id;
EXCEPTION
WHEN e_invalid_grade THEN
-- Log invalid grade and continue
INSERT INTO error_log (message, occurred_at)
VALUES ('Invalid grade for student ' || rec.student_id, SYSDATE);
WHEN DUP_VAL_ON_INDEX THEN
-- Handle duplicate
UPDATE pending_enrollments
SET error_message = 'Duplicate enrollment'
WHERE student_id = rec.student_id
AND course_id = rec.course_id;
WHEN OTHERS THEN
-- Unexpected error, log and continue
INSERT INTO error_log (message, occurred_at)
VALUES (SQLERRM, SYSDATE);
END;
END LOOP;
COMMIT;
EXCEPTION
WHEN OTHERS THEN
ROLLBACK;
RAISE_APPLICATION_ERROR(-20002, 'Fatal error: ' || SQLERRM);
END process_enrollments;
/Explicit Cursors (Oracle):
DECLARE
CURSOR c_student_courses (p_student_id NUMBER) IS
SELECT c.title, e.grade, c.credits
FROM enrollments e
JOIN courses c ON e.course_id = c.course_id
WHERE e.student_id = p_student_id
ORDER BY e.semester DESC;
v_student_name students.student_name%TYPE;
v_course_record c_student_courses%ROWTYPE;
BEGIN
SELECT student_name INTO v_student_name
FROM students
WHERE student_id = 1001;
DBMS_OUTPUT.PUT_LINE('Courses for ' || v_student_name);
DBMS_OUTPUT.PUT_LINE('------------------------');
OPEN c_student_courses(1001);
LOOP
FETCH c_student_courses INTO v_course_record;
EXIT WHEN c_student_courses%NOTFOUND;
DBMS_OUTPUT.PUT_LINE(
v_course_record.title || ': ' ||
NVL(v_course_record.grade, 'IP') ||
' (' || v_course_record.credits || ' credits)'
);
END LOOP;
DBMS_OUTPUT.PUT_LINE('Total courses: ' || c_student_courses%ROWCOUNT);
CLOSE c_student_courses;
END;
/Cursor FOR Loops:
DECLARE
CURSOR c_dept_summary IS
SELECT d.dept_name,
COUNT(DISTINCT s.student_id) AS student_count,
COUNT(DISTINCT f.faculty_id) AS faculty_count
FROM departments d
LEFT JOIN students s ON d.dept_code = s.major_code
LEFT JOIN faculty f ON d.dept_code = f.dept_code
GROUP BY d.dept_name;
BEGIN
FOR dept_rec IN c_dept_summary LOOP
DBMS_OUTPUT.PUT_LINE(
dept_rec.dept_name || ': ' ||
dept_rec.student_count || ' students, ' ||
dept_rec.faculty_count || ' faculty'
);
END LOOP;
END;
/Cursors with Parameters (SQL Server):
DECLARE @StudentID INT = 1001
DECLARE @CourseID CHAR(8)
DECLARE @Grade CHAR(2)
DECLARE student_courses CURSOR FOR
SELECT c.course_id, e.grade
FROM enrollments e
JOIN courses c ON e.course_id = c.course_id
WHERE e.student_id = @StudentID
OPEN student_courses
FETCH NEXT FROM student_courses INTO @CourseID, @Grade
WHILE @@FETCH_STATUS = 0
BEGIN
PRINT 'Course: ' + @CourseID + ', Grade: ' + ISNULL(@Grade, 'IP')
FETCH NEXT FROM student_courses INTO @CourseID, @Grade
END
CLOSE student_courses
DEALLOCATE student_coursesDynamic SQL constructs and executes SQL statements at runtime.
Oracle EXECUTE IMMEDIATE:
CREATE OR REPLACE PROCEDURE search_table(
p_table_name VARCHAR2,
p_column_name VARCHAR2,
p_search_value VARCHAR2
) IS
v_sql VARCHAR2(1000);
v_count NUMBER;
BEGIN
-- Build dynamic SQL
v_sql := 'SELECT COUNT(*) FROM ' || p_table_name ||
' WHERE ' || p_column_name || ' = :value';
-- Execute with USING clause
EXECUTE IMMEDIATE v_sql INTO v_count USING p_search_value;
DBMS_OUTPUT.PUT_LINE('Found ' || v_count || ' rows');
-- Dynamic cursor
v_sql := 'SELECT * FROM ' || p_table_name ||
' WHERE ' || p_column_name || ' = :value';
EXECUTE IMMEDIATE v_sql USING p_search_value;
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE('Error: ' || SQLERRM);
END search_table;
/SQL Server sp_executesql:
CREATE PROCEDURE sp_DynamicPivot
@SchemaName NVARCHAR(128),
@TableName NVARCHAR(128),
@PivotColumn NVARCHAR(128),
@AggregateColumn NVARCHAR(128)
AS
BEGIN
DECLARE @Columns NVARCHAR(MAX)
DECLARE @SQL NVARCHAR(MAX)
-- Build column list dynamically
SET @Columns = STUFF(
(SELECT DISTINCT ',' + QUOTENAME([value])
FROM (
SELECT DISTINCT @PivotColumn AS [value]
FROM sys.tables t
JOIN sys.columns c ON t.object_id = c.object_id
WHERE t.name = @TableName
AND SCHEMA_NAME(t.schema_id) = @SchemaName
) cols
FOR XML PATH('')), 1, 1, '')
-- Build dynamic pivot query
SET @SQL = '
SELECT *
FROM (
SELECT ' + @PivotColumn + ', ' + @AggregateColumn + '
FROM ' + QUOTENAME(@SchemaName) + '.' + QUOTENAME(@TableName) + '
) AS SourceTable
PIVOT (
COUNT(' + @AggregateColumn + ')
FOR ' + @PivotColumn + ' IN (' + @Columns + ')
) AS PivotTable'
EXEC sp_executesql @SQL
ENDDynamic SQL for Table Maintenance:
CREATE OR REPLACE PROCEDURE archive_old_records(
p_table_name VARCHAR2,
p_date_column VARCHAR2,
p_cutoff_date DATE,
p_archive_table VARCHAR2 DEFAULT NULL
) IS
v_archive_name VARCHAR2(100);
v_create_sql VARCHAR2(1000);
v_insert_sql VARCHAR2(1000);
v_delete_sql VARCHAR2(1000);
v_count NUMBER;
BEGIN
-- Set archive table name
v_archive_name := NVL(p_archive_table, p_table_name || '_ARCHIVE');
-- Create archive table if not exists
BEGIN
v_create_sql := 'CREATE TABLE ' || v_archive_name ||
' AS SELECT * FROM ' || p_table_name ||
' WHERE 1=0';
EXECUTE IMMEDIATE v_create_sql;
EXCEPTION
WHEN OTHERS THEN
NULL; -- Table already exists
END;
-- Archive old records
v_insert_sql := 'INSERT INTO ' || v_archive_name ||
' SELECT * FROM ' || p_table_name ||
' WHERE ' || p_date_column || ' < :cutoff';
EXECUTE IMMEDIATE v_insert_sql USING p_cutoff_date;
-- Get count of archived records
v_count := SQL%ROWCOUNT;
-- Delete archived records
v_delete_sql := 'DELETE FROM ' || p_table_name ||
' WHERE ' || p_date_column || ' < :cutoff';
EXECUTE IMMEDIATE v_delete_sql USING p_cutoff_date;
DBMS_OUTPUT.PUT_LINE('Archived ' || v_count || ' records from ' || p_table_name);
COMMIT;
EXCEPTION
WHEN OTHERS THEN
ROLLBACK;
RAISE_APPLICATION_ERROR(-20001, 'Archive failed: ' || SQLERRM);
END archive_old_records;
/Functional dependencies (FDs) are constraints that describe relationships between attributes in a relation. They form the theoretical foundation for database normalization.
Formal Definition:
Given a relation R with attribute set U, a functional dependency X → Y (read as "X determines Y" or "X functionally determines Y") holds if for any two tuples t1 and t2 in R:
If t1[X] = t2[X] then t1[Y] = t2[Y]
In other words, whenever two tuples agree on all attributes in X, they must also agree on all attributes in Y.
Example:
In a Student relation with attributes (student_id, student_name, major, advisor), the FD: student_id → student_name, major, advisor
This means each student_id uniquely determines the student's name, major, and advisor.
Notation:
- Capital letters (X, Y, Z) represent sets of attributes
- XY denotes the union of attribute sets X and Y
- X → Y denotes functional dependency
- F denotes a set of functional dependencies
Trivial vs Non-trivial FDs:
- Trivial FD: X → Y where Y ⊆ X (always true)
- Non-trivial FD: X → Y where Y ⊈ X
- Completely non-trivial FD: X ∩ Y = ∅
Examples:
- student_id, student_name → student_id (trivial)
- student_id → student_name (non-trivial)
- student_id → student_name (completely non-trivial if no overlap)
Armstrong's axioms are a set of inference rules used to derive all functional dependencies implied by a given set F.
Reflexivity (Subset Rule): If Y ⊆ X, then X → Y (Any set of attributes determines all its subsets)
Augmentation: If X → Y, then XZ → YZ for any Z (Adding the same attributes to both sides preserves dependency)
Transitivity: If X → Y and Y → Z, then X → Z (Dependencies chain together)
Additional Rules Derived from Armstrong's Axioms:
Union: If X → Y and X → Z, then X → YZ (Combine dependencies with same left side)
Decomposition: If X → YZ, then X → Y and X → Z (Split right side into individual attributes)
Pseudo-transitivity: If X → Y and WY → Z, then XW → Z (Extended transitivity with additional attributes)
Example Proof:
Given F = {student_id → major, major → advisor} Prove: student_id → advisor
Proof:
- student_id → major (given)
- major → advisor (given)
- student_id → advisor (transitivity: 1,2)
Closure of Functional Dependencies:
The closure F⁺ is the set of all FDs that can be derived from F using Armstrong's axioms.
Example: Given R = (A, B, C, G, H, I) and F = {A → B, A → C, CG → H, CG → I, B → H} Compute closure:
From A → B and B → H, we get A → H (transitivity) From A → B and A → C, we get A → BC (union) From CG → H and CG → I, we get CG → HI (union) From A → BC and augmentation, A → ABC
The closure of an attribute set X under FD set F, denoted X⁺, is the set of all attributes that are functionally determined by X.
Algorithm for Computing Attribute Closure:
closure = X
while changes to closure do
for each FD Y → Z in F do
if Y ⊆ closure then
closure = closure ∪ Z
end if
end for
end while
return closure
Example:
Given R = (A, B, C, D, E, F) and F = {A → BC, E → CF, B → E, CD → EF} Compute closure of {A, B}:
Step 1: closure = {A, B} Step 2: Using A → BC: A in closure, add C → closure = {A, B, C} Step 3: Using B → E: B in closure, add E → closure = {A, B, C, E} Step 4: Using E → CF: E in closure, add C, F → closure = {A, B, C, E, F} Step 5: Using CD → EF: Need C and D, D not in closure → no change Step 6: No more changes Result: {A, B}⁺ = {A, B, C, E, F}
Uses of Attribute Closure:
-
Testing Superkey: X is a superkey if X⁺ contains all attributes of R
-
Testing FD Validity: X → Y holds if Y ⊆ X⁺
-
Computing F⁺: For each subset X of attributes, compute X⁺
-
Finding Candidate Keys: Minimal attribute sets whose closure includes all attributes
Finding Candidate Keys:
Given R = (A, B, C, D) and F = {A → C, B → D, C → D} Find candidate keys:
- Try {A, B}: {A, B}⁺ = {A, B, C, D} → candidate key
- Try {A, C}: {A, C}⁺ = {A, C, D} (B missing) → not key
- Try {B, C}: {B, C}⁺ = {B, C, D} (A missing) → not key
- Try {A, B, C}: superkey but not minimal
- Try {A, D}: {A, D}⁺ = {A, D, C} (B missing) → not key
- Try {B, C, D}: {B, C, D}⁺ = {B, C, D} (A missing) → not key
Only {A, B} is a candidate key.
A canonical cover (or minimal cover) Fc is a minimal set of functional dependencies equivalent to F, with the following properties:
- Right side of each FD is a single attribute
- No FD in Fc is redundant (can be derived from others)
- Left side of each FD is minimal (no extraneous attributes)
Algorithm for Computing Canonical Cover:
Fc = F
// Step 1: Decompose right sides
Replace each X → YZ with X → Y and X → Z
// Step 2: Remove extraneous attributes from left sides
for each FD X → A in Fc do
for each attribute B in X do
compute (X - {B})⁺ using current Fc
if A in (X - {B})⁺ then
replace X → A with (X - {B}) → A
end if
end for
end for
// Step 3: Remove redundant FDs
for each FD f in Fc do
compute closure using (Fc - {f})
if f can be derived then
remove f from Fc
end if
end for
return Fc
Example:
Given F = {A → BC, B → C, A → B, AB → C}
Step 1: Decompose right sides F = {A → B, A → C, B → C, A → B, AB → C} Remove duplicates: {A → B, A → C, B → C, AB → C}
Step 2: Remove extraneous attributes Check AB → C:
- Compute (A)⁺ using {A → B, A → C, B → C} = {A, B, C}
- Since C in A⁺, B is extraneous in AB → C Replace AB → C with A → C (but A → C already exists) So remove AB → C
Now Fc = {A → B, A → C, B → C}
Step 3: Remove redundant FDs Check A → C: Compute closure of {A} using {A → B, B → C} = {A, B, C} Since C in closure, A → C is redundant → remove
Final canonical cover: Fc = {A → B, B → C}
A decomposition is dependency-preserving if all FDs in the original set F can be checked on the individual decomposed relations without performing joins.
Definition:
Let R be a relation with FD set F, and let R₁, R₂, ..., Rₙ be a decomposition of R. The projection of F onto Rᵢ (denoted πRᵢ(F)) is the set of all X → Y in F⁺ such that XY ⊆ Rᵢ.
The decomposition is dependency-preserving if the union of all πRᵢ(F) logically implies all FDs in F.
Algorithm for Testing Dependency Preservation:
Compute canonical cover Fc of F
for each FD X → Y in Fc do
result = X
while changes to result do
for each decomposed relation Rᵢ do
t = (result ∩ Rᵢ)⁺ ∩ Rᵢ
result = result ∪ t
end for
end while
if Y ⊈ result then
return false // FD not preserved
end if
end for
return true // All FDs preserved
Example:
Given R = (A, B, C, D) and F = {A → B, B → C, C → D} Decompose into R₁ = (A, B), R₂ = (B, C), R₃ = (C, D)
Check preservation:
For A → B:
- result = {A}
- Using R₁: (A ∩ R₁)⁺ = A⁺ = {A, B, C, D}
- (A⁺ ∩ R₁) = {A, B}
- result = {A, B}
- B in result → preserved
For B → C:
- result = {B}
- Using R₂: (B ∩ R₂)⁺ = B⁺ = {B, C, D}
- (B⁺ ∩ R₂) = {B, C}
- result = {B, C}
- C in result → preserved
For C → D:
- result = {C}
- Using R₃: (C ∩ R₃)⁺ = C⁺ = {C, D}
- (C⁺ ∩ R₃) = {C, D}
- result = {C, D}
- D in result → preserved
All FDs preserved → decomposition is dependency-preserving.
Database anomalies are problems that can occur in poorly designed databases, leading to data inconsistency and redundancy.
Insertion Anomalies:
Occur when certain facts cannot be recorded without recording other unrelated facts.
Example table: Student_Course (student_id, student_name, course_id, course_title, instructor, grade)
Problems:
- Cannot add a new course until a student enrolls in it
- Cannot add a new student until they enroll in at least one course
- Adding a new course requires repeating course information for each enrollment
Deletion Anomalies:
Occur when deleting facts about one subject unintentionally deletes facts about another subject.
Example: If we delete the last student enrolled in a course, we lose all information about that course (title, instructor, etc.)
Update Anomalies:
Occur when updating a fact requires multiple changes, leading to inconsistency if some changes are missed.
Example: If a course title changes, we must update every row where that course appears. Failure to update all rows creates inconsistent data.
A relation is in 1NF if every attribute contains only atomic (indivisible) values.
Violations of 1NF:
- Multivalued attributes: An attribute that can have multiple values
- Composite attributes: Attributes that contain structured data
- Repeating groups: Multiple attributes storing similar data (phone1, phone2, phone3)
Converting to 1NF:
Original (not in 1NF):
Student (student_id, student_name, phones)
Data: (101, 'Alice', '555-1234, 555-5678')
Option 1: Flatten into multiple rows
Student (student_id, student_name, phone)
(101, 'Alice', '555-1234')
(101, 'Alice', '555-5678')
Option 2: Separate into two tables
Student (student_id, student_name)
Student_Phone (student_id, phone)
Example - Books and Authors:
Not in 1NF:
Book (ISBN, title, authors)
(0-123-45678-9, 'Database Design', 'Codd, Date, Ullman')
Convert to 1NF:
Book_Author (ISBN, title, author)
(0-123-45678-9, 'Database Design', 'Codd')
(0-123-45678-9, 'Database Design', 'Date')
(0-123-45678-9, 'Database Design', 'Ullman')
Or better:
Book (ISBN, title)
Author (author_id, author_name)
Book_Author (ISBN, author_id)
A relation is in 2NF if it is in 1NF and every non-key attribute is fully functionally dependent on the entire primary key (no partial dependencies).
Partial Dependency: When a non-key attribute depends on only part of a composite primary key.
Example Violating 2NF:
Enrollment (student_id, course_id, student_name, course_title, grade)
Primary key: (student_id, course_id)
Dependencies:
- student_id, course_id → grade (full dependency)
- student_id → student_name (partial dependency)
- course_id → course_title (partial dependency)
This violates 2NF because student_name depends on only part of the key (student_id), and course_title depends on only part (course_id).
Converting to 2NF:
Decompose into:
Student (student_id, student_name)
Course (course_id, course_title)
Enrollment (student_id, course_id, grade)
Now all non-key attributes depend on the entire primary key in each relation.
Another Example - Order Items:
Order_Item (order_id, product_id, order_date, customer_name, product_name, quantity, price)
Primary key: (order_id, product_id)
Dependencies:
- order_id, product_id → quantity, price
- order_id → order_date, customer_name
- product_id → product_name
Partial dependencies exist → violates 2NF
Solution:
Order (order_id, order_date, customer_name)
Product (product_id, product_name)
Order_Detail (order_id, product_id, quantity, price)
A relation is in 3NF if it is in 2NF and no non-key attribute is transitively dependent on the primary key.
Transitive Dependency: When a non-key attribute depends on another non-key attribute.
Example Violating 3NF:
Employee (emp_id, emp_name, dept_id, dept_name, dept_location)
Primary key: emp_id
Dependencies:
- emp_id → dept_id
- dept_id → dept_name, dept_location (transitive dependency)
dept_name and dept_location depend on dept_id, not directly on emp_id.
Converting to 3NF:
Decompose into:
Employee (emp_id, emp_name, dept_id)
Department (dept_id, dept_name, dept_location)
Another Example - Student Advising:
Student (student_id, student_name, advisor_id, advisor_name, advisor_office)
Primary key: student_id
Dependencies:
- student_id → advisor_id
- advisor_id → advisor_name, advisor_office (transitive)
Solution:
Student (student_id, student_name, advisor_id)
Advisor (advisor_id, advisor_name, advisor_office)
A relation is in BCNF if for every non-trivial functional dependency X → Y, X is a superkey.
BCNF is stricter than 3NF—every BCNF relation is in 3NF, but not vice versa.
Example Violating BCNF (but in 3NF):
Student_Advisor (student_id, major, advisor)
Assume:
- Each student has one major and one advisor
- Each advisor advises students in only one major
- A major may have multiple advisors
FDs:
1. student_id → major, advisor
2. advisor → major
Candidate key: student_id
Check BCNF:
- FD1: student_id → major, advisor: student_id is key → OK
- FD2: advisor → major: advisor is not a superkey → violates BCNF
This relation is in 3NF but not BCNF.
Converting to BCNF:
Decompose based on violating FD advisor → major:
Student (student_id, advisor)
Advisor_Major (advisor, major)
Now both relations satisfy BCNF:
- In Student, only FD is student_id → advisor (key determines non-key)
- In Advisor_Major, only FD is advisor → major (key determines non-key)
Another BCNF Example:
Enrollment (student_id, course_id, instructor, textbook)
Assume:
- Each course has one instructor
- Each instructor uses one textbook for a given course
- Students can take any course with any instructor
FDs:
- course_id → instructor
- instructor, course_id → textbook
Candidate keys: (student_id, course_id)
Check BCNF:
- course_id → instructor: course_id is not a superkey → violates BCNF
Decompose:
Course_Instructor (course_id, instructor)
Enrollment_Textbook (student_id, course_id, textbook)
But still have FD: instructor, course_id → textbook
In Enrollment_Textbook, (student_id, course_id) is key, but FD involves instructor (not part of key in this relation) → further decomposition needed
4NF addresses multivalued dependencies (MVDs), which occur when independent facts are repeated in a relation.
Multivalued Dependency (MVD):
X →→ Y (read as "X multidetermines Y") means that given a value of X, the set of Y values is independent of other attributes.
Formally: For any two tuples t1 and t2 with t1[X] = t2[X], there exist tuples t3 and t4 such that:
- t3[X] = t1[X], t3[Y] = t1[Y], t3[Z] = t2[Z]
- t4[X] = t1[X], t4[Y] = t2[Y], t4[Z] = t1[Z] where Z = R - (X ∪ Y)
Example - Student Courses and Hobbies:
Student_Hobby_Course (student_id, hobby, course)
student_id →→ hobby
student_id →→ course
Sample data:
student_id | hobby | course
-----------|----------|--------
101 | swimming | CS101
101 | swimming | MATH201
101 | reading | CS101
101 | reading | MATH201
This relation has redundancy because hobbies and courses are independent—every combination must appear.
4NF Definition:
A relation is in 4NF if for every nontrivial MVD X →→ Y, X is a superkey.
Converting to 4NF:
Decompose based on MVDs:
Student_Hobby (student_id, hobby)
Student_Course (student_id, course)
Now each relation contains only one independent fact per student.
5NF (also called Project-Join Normal Form) deals with join dependencies (JDs). A relation is in 5NF if every join dependency in the relation is implied by the candidate keys.
Join Dependency (JD):
A join dependency JD(R₁, R₂, ..., Rₙ) means that R is equal to the natural join of its projections onto R₁, R₂, ..., Rₙ.
Example - Suppliers, Parts, Projects:
Supply (supplier_id, part_id, project_id)
Assume business rule: A supplier supplies a part for a project if and only if:
- The supplier supplies that part (for some project)
- The supplier supplies some part for that project
- Some supplier supplies that part for that project
This creates a join dependency: JD(* [supplier_id, part_id],
[supplier_id, project_id],
[part_id, project_id])
The relation can be reconstructed by joining three binary relations, so it should be decomposed.
Converting to 5NF:
Supplier_Part (supplier_id, part_id)
Supplier_Project (supplier_id, project_id)
Part_Project (part_id, project_id)
Formal Definition:
Let R be a relation schema with attribute sets X, Y, and Z = R - (X ∪ Y). The multivalued dependency X →→ Y holds if for any two tuples t1 and t2 in any instance of R with t1[X] = t2[X], there exist tuples t3 and t4 in the instance such that:
- t3[X] = t1[X]
- t3[Y] = t1[Y]
- t3[Z] = t2[Z]
- t4[X] = t1[X]
- t4[Y] = t2[Y]
- t4[Z] = t1[Z]
Properties of MVDs:
- Complementation: If X →→ Y, then X →→ (R - X - Y)
- Reflexivity: If Y ⊆ X, then X →→ Y
- Augmentation: If X →→ Y and V ⊆ W, then XW →→ YV
- Transitivity: If X →→ Y and Y →→ Z, then X →→ (Z - Y)
- Replication: Every FD is an MVD (X → Y implies X →→ Y)
Example - Employee Skills and Languages:
Employee_Skill_Lang (emp_id, skill, language)
emp_id →→ skill
emp_id →→ language
Data:
emp_id | skill | language
-------|-----------|----------
101 | Java | English
101 | Java | Spanish
101 | Python | English
101 | Python | Spanish
This MVD creates a Cartesian product of skills and languages for each employee.
Definition:
A join dependency JD(R₁, R₂, ..., Rₙ) holds for relation R if R is equal to the natural join of its projections onto R₁, R₂, ..., Rₙ.
Example - Agency Database:
Consider a relation representing which agents represent which clients for which products:
Representation (agent_id, client_id, product_id)
Business rule: An agent represents a client for a product if and only if:
- The agent represents the client (for some product)
- The agent sells that product (to some client)
- The client uses that product (with some agent)
This creates a 3-way join dependency.
Check if decomposition is lossless:
Representation = π[agent_id, client_id](R) ⋈ π[agent_id, product_id](R) ⋈ π[client_id, product_id](R)
If this holds, the relation should be decomposed.
MVD vs JD:
- An MVD is a binary join dependency (JD with two components)
- JDs generalize MVDs to multiple components
Denormalization is the intentional introduction of redundancy to improve query performance.
When to Denormalize:
- Performance requirements: Queries must be faster than normalized design allows
- Read-heavy workloads: OLAP systems with few updates
- Complex joins: Frequently joined tables causing performance issues
- Reporting requirements: Pre-aggregated data needed for dashboards
Denormalization Techniques:
1. Pre-joined Tables:
Store frequently joined data together.
Normalized:
Order (order_id, customer_id, order_date)
Customer (customer_id, customer_name, customer_address)
Order_Item (order_id, product_id, quantity)
Product (product_id, product_name, price)
Denormalized:
Order_Fact (order_id, customer_name, customer_address,
order_date, product_name, quantity, price)
2. Derived Columns:
Store computed values to avoid recalculation.
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
subtotal DECIMAL(10,2),
tax DECIMAL(10,2),
shipping DECIMAL(10,2),
total DECIMAL(10,2) GENERATED ALWAYS AS
(subtotal + tax + shipping) STORED
);3. Pre-aggregated Tables:
Create summary tables for reporting.
CREATE TABLE daily_sales_summary AS
SELECT
DATE(order_date) AS sale_date,
product_id,
COUNT(*) AS order_count,
SUM(quantity) AS total_quantity,
SUM(amount) AS total_amount
FROM orders
GROUP BY DATE(order_date), product_id;4. Vertical Partitioning:
Split tables by column groups with different access patterns.
-- Frequently accessed columns
CREATE TABLE employee_fast (
emp_id INT PRIMARY KEY,
emp_name VARCHAR(100),
department VARCHAR(50),
salary DECIMAL(10,2)
);
-- Infrequently accessed columns
CREATE TABLE employee_slow (
emp_id INT PRIMARY KEY,
emergency_contact VARCHAR(100),
health_insurance VARCHAR(50),
notes TEXT
);5. Horizontal Partitioning (Sharding):
Split tables by row groups.
-- Partition by date range
CREATE TABLE orders_2023 PARTITION OF orders
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
CREATE TABLE orders_2024 PARTITION OF orders
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');6. Materialized Views:
Pre-computed query results maintained automatically.
CREATE MATERIALIZED VIEW customer_order_summary
REFRESH COMPLETE ON DEMAND
AS
SELECT
c.customer_id,
c.customer_name,
COUNT(o.order_id) AS order_count,
SUM(o.total) AS lifetime_value,
MAX(o.order_date) AS last_order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name;Risks of Denormalization:
- Update anomalies: Redundant data must be kept consistent
- Increased storage: Redundancy consumes more space
- Complex maintenance: More complex update logic
- Data inconsistency: Higher risk of conflicting values
- Harder to evolve: Schema changes more complex
Denormalization Checklist:
- Have you optimized normalized design first?
- Is the performance gain worth the redundancy cost?
- Can you automate consistency maintenance?
- Have you documented denormalization decisions?
- Is there a plan to handle updates to redundant data?
Top-down design starts with high-level business requirements and progressively refines them into detailed database structures.
Phases of Top-Down Design:
Phase 1: Requirements Analysis
- Interview stakeholders
- Review existing systems and documents
- Identify business processes and rules
- Document data requirements
- Create use cases and data flow diagrams
Phase 2: Conceptual Design
- Identify entities and relationships
- Create ER/EER diagrams
- Define business rules and constraints
- Validate with users
- Deliverable: Conceptual data model
Phase 3: Logical Design
- Map ER model to relational schema
- Normalize relations
- Define integrity constraints
- Create views for user access
- Deliverable: Logical database schema
Phase 4: Physical Design
- Choose storage structures
- Design indexes
- Plan partitioning
- Estimate sizes
- Create DDL scripts
- Deliverable: Physical database implementation
Phase 5: Implementation and Testing
- Create database objects
- Load test data
- Test queries and transactions
- Validate performance
- Document the system
Advantages of Top-Down:
- Focuses on business needs
- Ensures comprehensive requirements coverage
- Produces well-documented design
- Easier to validate with users
Disadvantages:
- Time-consuming
- May miss existing data sources
- Can be too abstract for implementers
Bottom-up design starts with existing data sources and integrates them into a coherent database structure.
Phases of Bottom-Up Design:
Phase 1: Source Analysis
- Inventory existing data sources
- Analyze file formats and structures
- Document current data usage
- Identify data quality issues
Phase 2: View Integration
- Identify common data elements
- Resolve naming conflicts
- Resolve structural conflicts
- Resolve constraint conflicts
Phase 3: Schema Integration
- Create global schema
- Define mappings from sources
- Identify and resolve redundancies
- Normalize integrated schema
Phase 4: Reverse Engineering
- Extract existing database schemas
- Document implicit assumptions
- Identify hidden constraints
- Understand legacy design decisions
Advantages of Bottom-Up:
- Leverages existing data assets
- Faster initial results
- Practical and grounded in reality
- Good for data warehouse design
Disadvantages:
- May perpetuate bad designs
- Can miss business requirements
- Difficult to achieve integration
- May lack overall coherence
View integration combines multiple user views into a unified conceptual schema.
Steps in View Integration:
Step 1: Pre-Integration Analysis
- Review each view independently
- Document view characteristics
- Identify potential conflicts
Step 2: Conflict Resolution
Naming Conflicts:
- Synonyms: Same concept, different names
- Homonyms: Different concepts, same name
Example:
- View1: "Customer" means person who buys
- View2: "Customer" means company that buys → homonym
Structural Conflicts:
- Type conflicts: Same concept, different representations
- Domain conflicts: Different value ranges
- Constraint conflicts: Different business rules
Example:
- View1: Employee has single phone number (attribute)
- View2: Employee has multiple phones (repeating group)
Step 3: Merging Views
- Identify common concepts
- Create superclass/subclass hierarchies
- Add relationships between views
- Resolve redundancies
Step 4: Schema Restructuring
- Optimize integrated schema
- Apply normalization
- Add integrity constraints
- Validate with stakeholders
View Integration Example:
View1 (Registrar): Student(student_id, name, dob, address) View2 (Financial Aid): Student(ssn, name, income, aid_amount) View3 (Housing): Resident(student_id, room, building)
Integration:
- Identify student_id and ssn as different identifiers for same entity
- Create unified Student entity with both identifiers
- Add relationships to Housing
- Resolve naming: use consistent attribute names
Schema refinement improves an initial database design to meet quality goals.
Quality Criteria:
- Completeness: All required data represented
- Correctness: Accurately models real world
- Minimality: No redundant data
- Clarity: Easy to understand
- Flexibility: Accommodates change
- Performance: Supports required workloads
Refinement Process:
Step 1: Identify Redundancies
- Look for repeating data
- Check for derived data that can be computed
- Identify unnecessary duplications
Step 2: Apply Normalization
- Check for 1NF violations
- Remove partial dependencies (2NF)
- Remove transitive dependencies (3NF)
- Consider BCNF, 4NF for special cases
Step 3: Validate Against Requirements
- Can all required queries be expressed?
- Are all business rules enforced?
- Are there missing relationships?
Step 4: Consider Performance
- Identify frequent query patterns
- Consider denormalization for performance
- Plan indexing strategy
Step 5: Review Integrity Constraints
- Add missing constraints
- Ensure referential integrity
- Define business rule constraints
Case Study 1: Banking System
Requirements:
- Manage customer accounts (checking, savings)
- Process transactions (deposits, withdrawals, transfers)
- Track account balances
- Generate monthly statements
- Handle multiple account owners
- Support branch network
Conceptual Design:
Entities:
- Customer (customer_id, name, ssn, address, phone, email)
- Account (account_id, type, balance, interest_rate, open_date)
- Branch (branch_id, name, address, phone)
- Transaction (transaction_id, type, amount, date, description)
Relationships:
- Customer owns Account (M:N with joint ownership)
- Account belongs to Branch (N:1)
- Account has Transaction (1:N)
Logical Design:
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
ssn CHAR(11) UNIQUE NOT NULL,
name VARCHAR(100) NOT NULL,
address TEXT,
phone VARCHAR(20),
email VARCHAR(100),
created_date DATE DEFAULT CURRENT_DATE
);
CREATE TABLE branches (
branch_id INTEGER PRIMARY KEY,
branch_name VARCHAR(100) NOT NULL,
address TEXT,
phone VARCHAR(20)
);
CREATE TABLE accounts (
account_id INTEGER PRIMARY KEY,
account_type VARCHAR(20) CHECK (account_type IN ('checking', 'savings', 'money_market')),
balance DECIMAL(15,2) DEFAULT 0.00,
interest_rate DECIMAL(5,4),
open_date DATE DEFAULT CURRENT_DATE,
branch_id INTEGER REFERENCES branches(branch_id)
);
CREATE TABLE account_owners (
customer_id INTEGER REFERENCES customers(customer_id),
account_id INTEGER REFERENCES accounts(account_id),
ownership_percentage DECIMAL(5,2) DEFAULT 100.00,
PRIMARY KEY (customer_id, account_id)
);
CREATE TABLE transactions (
transaction_id INTEGER PRIMARY KEY,
account_id INTEGER REFERENCES accounts(account_id),
transaction_type VARCHAR(20) CHECK (transaction_type IN ('deposit', 'withdrawal', 'transfer', 'fee', 'interest')),
amount DECIMAL(15,2) NOT NULL,
balance_after DECIMAL(15,2) NOT NULL,
transaction_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
description VARCHAR(200),
related_account_id INTEGER REFERENCES accounts(account_id) -- for transfers
);Physical Design:
-- Indexes for performance
CREATE INDEX idx_transactions_account ON transactions(account_id);
CREATE INDEX idx_transactions_date ON transactions(transaction_date);
CREATE INDEX idx_account_owners_customer ON account_owners(customer_id);
-- Partition transactions by date
CREATE TABLE transactions_2023 PARTITION OF transactions
FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
-- Materialized view for daily balances
CREATE MATERIALIZED VIEW daily_balances AS
SELECT
account_id,
DATE(transaction_date) AS balance_date,
LAST_VALUE(balance_after) OVER (
PARTITION BY account_id, DATE(transaction_date)
ORDER BY transaction_date
) AS end_of_day_balance
FROM transactions;Case Study 2: E-Commerce Platform
Requirements:
- Product catalog with categories
- Shopping cart functionality
- Order processing
- Customer management
- Inventory tracking
- Reviews and ratings
Conceptual Design:
Entities:
- Customer (customer_id, email, password, name, addresses)
- Product (product_id, name, description, price, stock)
- Category (category_id, name, description)
- Order (order_id, order_date, status, total)
- OrderItem (quantity, price_at_time)
- Review (rating, comment, date)
- Cart (created_date)
Relationships:
- Product belongs to Category (M:N)
- Customer places Order (1:N)
- Order contains Product (M:N through OrderItem)
- Customer writes Review for Product (1:N per combination)
- Customer has Cart (1:1)
Logical Design:
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
email VARCHAR(200) UNIQUE NOT NULL,
password_hash CHAR(60) NOT NULL,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_login TIMESTAMP
);
CREATE TABLE addresses (
address_id INTEGER PRIMARY KEY,
customer_id INTEGER REFERENCES customers(customer_id),
address_type VARCHAR(20) CHECK (address_type IN ('shipping', 'billing')),
street VARCHAR(200) NOT NULL,
city VARCHAR(100) NOT NULL,
state VARCHAR(50),
postal_code VARCHAR(20),
country VARCHAR(100) NOT NULL,
is_default BOOLEAN DEFAULT FALSE
);
CREATE TABLE categories (
category_id INTEGER PRIMARY KEY,
category_name VARCHAR(100) NOT NULL,
description TEXT,
parent_category_id INTEGER REFERENCES categories(category_id)
);
CREATE TABLE products (
product_id INTEGER PRIMARY KEY,
product_name VARCHAR(200) NOT NULL,
description TEXT,
price DECIMAL(10,2) NOT NULL CHECK (price >= 0),
stock_quantity INTEGER NOT NULL DEFAULT 0,
sku VARCHAR(50) UNIQUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP
);
CREATE TABLE product_categories (
product_id INTEGER REFERENCES products(product_id),
category_id INTEGER REFERENCES categories(category_id),
PRIMARY KEY (product_id, category_id)
);
CREATE TABLE carts (
cart_id INTEGER PRIMARY KEY,
customer_id INTEGER UNIQUE REFERENCES customers(customer_id),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP
);
CREATE TABLE cart_items (
cart_id INTEGER REFERENCES carts(cart_id),
product_id INTEGER REFERENCES products(product_id),
quantity INTEGER NOT NULL CHECK (quantity > 0),
added_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (cart_id, product_id)
);
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER REFERENCES customers(customer_id),
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
status VARCHAR(20) CHECK (status IN ('pending', 'processing', 'shipped', 'delivered', 'cancelled')),
shipping_address_id INTEGER REFERENCES addresses(address_id),
billing_address_id INTEGER REFERENCES addresses(address_id),
subtotal DECIMAL(10,2) NOT NULL,
tax DECIMAL(10,2) NOT NULL,
shipping_cost DECIMAL(10,2) NOT NULL,
total DECIMAL(10,2) NOT NULL,
payment_method VARCHAR(50),
payment_status VARCHAR(20)
);
CREATE TABLE order_items (
order_id INTEGER REFERENCES orders(order_id),
product_id INTEGER REFERENCES products(product_id),
quantity INTEGER NOT NULL,
price_at_time DECIMAL(10,2) NOT NULL, -- Snapshot of product price
PRIMARY KEY (order_id, product_id)
);
CREATE TABLE reviews (
review_id INTEGER PRIMARY KEY,
customer_id INTEGER REFERENCES customers(customer_id),
product_id INTEGER REFERENCES products(product_id),
rating INTEGER CHECK (rating BETWEEN 1 AND 5),
comment TEXT,
review_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(customer_id, product_id) -- One review per customer per product
);Physical Design with Performance Optimizations:
-- Full-text search for products
CREATE INDEX idx_products_search ON products USING GIN (to_tsvector('english', product_name || ' ' || description));
-- Partial index for active carts
CREATE INDEX idx_active_carts ON carts(customer_id)
WHERE updated_at > CURRENT_TIMESTAMP - INTERVAL '7 days';
-- Composite index for order history
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date DESC);
-- Inventory alerts
CREATE VIEW low_stock_products AS
SELECT product_id, product_name, stock_quantity
FROM products
WHERE stock_quantity < 10;Case Study 3: Healthcare System
Requirements:
- Patient records management
- Appointment scheduling
- Medical history tracking
- Prescription management
- Insurance billing
- HIPAA compliance
Conceptual Design:
Entities:
- Patient (patient_id, name, dob, contact, insurance)
- Provider (provider_id, name, specialty, credentials)
- Appointment (appointment_id, date_time, status, reason)
- Diagnosis (diagnosis_code, description, category)
- Medication (medication_code, name, dosage_form, manufacturer)
- Prescription (prescribed_date, dosage, instructions, refills)
- Encounter (encounter_id, date, type, notes)
- InsuranceClaim (claim_id, amount, status, date)
Relationships:
- Patient schedules Appointment with Provider
- Provider diagnoses Patient (through Encounter)
- Provider prescribes Medication to Patient
- Encounter documents Patient-Provider interaction
- InsuranceClaim covers Encounter
Logical Design (with HIPAA considerations):
-- Separate PHI (Protected Health Information) for security
CREATE TABLE patients_demographics (
patient_id INTEGER PRIMARY KEY,
mrn VARCHAR(20) UNIQUE NOT NULL, -- Medical Record Number
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
date_of_birth DATE NOT NULL,
gender CHAR(1) CHECK (gender IN ('M', 'F', 'O')),
ssn_last4 CHAR(4), -- Only last 4 digits for security
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Sensitive data encrypted at application level
CREATE TABLE patients_contact (
patient_id INTEGER PRIMARY KEY REFERENCES patients_demographics(patient_id),
address_encrypted TEXT NOT NULL, -- Encrypted
phone_encrypted VARCHAR(200) NOT NULL, -- Encrypted
email_encrypted VARCHAR(200), -- Encrypted
emergency_contact_encrypted TEXT -- Encrypted
);
CREATE TABLE patients_insurance (
insurance_id INTEGER PRIMARY KEY,
patient_id INTEGER REFERENCES patients_demographics(patient_id),
provider_name VARCHAR(200) NOT NULL,
policy_number_encrypted VARCHAR(200) NOT NULL, -- Encrypted
group_number_encrypted VARCHAR(200), -- Encrypted
coverage_start_date DATE,
coverage_end_date DATE
);
CREATE TABLE providers (
provider_id INTEGER PRIMARY KEY,
npi_number CHAR(10) UNIQUE NOT NULL, -- National Provider Identifier
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
specialty VARCHAR(100) NOT NULL,
credentials VARCHAR(100),
department VARCHAR(100)
);
CREATE TABLE appointments (
appointment_id INTEGER PRIMARY KEY,
patient_id INTEGER REFERENCES patients_demographics(patient_id),
provider_id INTEGER REFERENCES providers(provider_id),
scheduled_time TIMESTAMP NOT NULL,
duration_minutes INTEGER DEFAULT 30,
status VARCHAR(20) CHECK (status IN ('scheduled', 'completed', 'cancelled', 'no-show')),
reason VARCHAR(500),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_appointments_patient (patient_id),
INDEX idx_appointments_provider (provider_id),
INDEX idx_appointments_datetime (scheduled_time)
);
CREATE TABLE encounters (
encounter_id INTEGER PRIMARY KEY,
patient_id INTEGER REFERENCES patients_demographics(patient_id),
provider_id INTEGER REFERENCES providers(provider_id),
appointment_id INTEGER REFERENCES appointments(appointment_id),
encounter_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
encounter_type VARCHAR(50) CHECK (encounter_type IN ('office', 'telehealth', 'emergency', 'inpatient')),
chief_complaint TEXT,
assessment_notes TEXT, -- May contain PHI
treatment_plan TEXT,
followup_required BOOLEAN DEFAULT FALSE
);
CREATE TABLE diagnoses (
diagnosis_id INTEGER PRIMARY KEY,
icd10_code VARCHAR(10) NOT NULL, -- ICD-10 diagnosis code
description TEXT NOT NULL,
category VARCHAR(100)
);
CREATE TABLE encounter_diagnoses (
encounter_id INTEGER REFERENCES encounters(encounter_id),
diagnosis_id INTEGER REFERENCES diagnoses(diagnosis_id),
is_primary BOOLEAN DEFAULT FALSE,
notes TEXT,
PRIMARY KEY (encounter_id, diagnosis_id)
);
CREATE TABLE medications (
medication_id INTEGER PRIMARY KEY,
ndc_code VARCHAR(20) UNIQUE NOT NULL, -- National Drug Code
medication_name VARCHAR(200) NOT NULL,
generic_name VARCHAR(200),
dosage_form VARCHAR(50),
manufacturer VARCHAR(200),
requires_prescription BOOLEAN DEFAULT TRUE
);
CREATE TABLE prescriptions (
prescription_id INTEGER PRIMARY KEY,
encounter_id INTEGER REFERENCES encounters(encounter_id),
patient_id INTEGER REFERENCES patients_demographics(patient_id),
provider_id INTEGER REFERENCES providers(provider_id),
medication_id INTEGER REFERENCES medications(medication_id),
prescribed_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
dosage_instructions TEXT NOT NULL,
quantity DECIMAL(10,2) NOT NULL,
refills_authorized INTEGER DEFAULT 0,
dispense_as_written BOOLEAN DEFAULT FALSE,
status VARCHAR(20) CHECK (status IN ('active', 'completed', 'cancelled', 'expired'))
);
-- Audit log for HIPAA compliance
CREATE TABLE access_log (
log_id INTEGER PRIMARY KEY,
user_id VARCHAR(100) NOT NULL,
access_type VARCHAR(50) CHECK (access_type IN ('view', 'create', 'update', 'delete')),
table_name VARCHAR(100) NOT NULL,
record_id INTEGER,
access_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
ip_address VARCHAR(45),
reason_for_access VARCHAR(500)
);
-- Create views for common queries
CREATE VIEW patient_summary AS
SELECT
p.patient_id,
p.mrn,
CONCAT(p.last_name, ', ', p.first_name) AS patient_name,
p.date_of_birth,
TIMESTAMPDIFF(YEAR, p.date_of_birth, CURDATE()) AS age,
COUNT(DISTINCT e.encounter_id) AS total_encounters,
MAX(e.encounter_date) AS last_encounter,
COUNT(DISTINCT pr.prescription_id) AS active_prescriptions
FROM patients_demographics p
LEFT JOIN encounters e ON p.patient_id = e.patient_id
LEFT JOIN prescriptions pr ON p.patient_id = pr.patient_id AND pr.status = 'active'
GROUP BY p.patient_id;Security Implementation:
-- Role-based access control
CREATE ROLE healthcare_provider;
CREATE ROLE billing_staff;
CREATE ROLE admin_staff;
-- Grant minimal necessary privileges
GRANT SELECT ON patients_demographics TO healthcare_provider;
GRANT SELECT, INSERT, UPDATE ON encounters TO healthcare_provider;
GRANT SELECT ON patients_insurance TO billing_staff;
GRANT SELECT, UPDATE ON insurance_claims TO billing_staff;
-- Row-level security (PostgreSQL example)
CREATE POLICY patient_access_policy ON encounters
USING (patient_id = current_setting('app.current_patient_id')::INT);
-- Encrypt sensitive data
CREATE EXTENSION IF NOT EXISTS pgcrypto;
UPDATE patients_contact
SET address_encrypted = pgp_sym_encrypt(address, 'encryption-key');A transaction is a logical unit of work that accesses and possibly modifies the contents of a database. Transactions transform the database from one consistent state to another.
Formal Definition:
A transaction T is a sequence of read and write operations (reads and writes) on database items that must appear as a single, indivisible unit.
Transaction Operations:
- BEGIN TRANSACTION: Marks the start of transaction execution
- READ(X): Reads database item X into program variable
- WRITE(X): Writes program variable value to database item X
- COMMIT: Signals successful end of transaction, making changes permanent
- ROLLBACK: Signals unsuccessful end, undoing any changes made
Example Transaction - Funds Transfer:
BEGIN TRANSACTION;
READ(account_A);
account_A.balance = account_A.balance - 100;
WRITE(account_A);
READ(account_B);
account_B.balance = account_B.balance + 100;
WRITE(account_B);
COMMIT;
ACID properties ensure reliable transaction processing.
Atomicity:
A transaction is an all-or-nothing unit. Either all operations complete successfully, or none take effect.
If the funds transfer transaction fails after debiting account A but before crediting account B, atomicity ensures the debit is undone (rolled back).
Implementation mechanisms:
- Shadow copying: Maintain before and after images
- Transaction log: Record changes for rollback
- Write-ahead logging: Log changes before applying them
Consistency:
A transaction preserves database consistency. If the database is consistent before the transaction, it must be consistent after.
Consistency includes:
- Explicit constraints (primary keys, foreign keys, CHECK constraints)
- Implicit business rules (account balance never negative)
- Application-level consistency (sum of debits equals sum of credits)
Isolation:
Concurrent transactions appear to execute in isolation. Each transaction should see the database as if it were the only active transaction.
Isolation levels (from lowest to highest):
- Read Uncommitted (dirty reads possible)
- Read Committed (no dirty reads)
- Repeatable Read (no non-repeatable reads)
- Serializable (transactions appear sequential)
Durability:
Once a transaction commits, its effects persist even in case of system failure. Committed data survives:
- Power failures
- System crashes
- Media failures (with proper backup/recovery)
Implementation mechanisms:
- Write-ahead logging
- Checkpointing
- Replicated storage
State Transition Diagram:
+---------+
| Active |
+---------+
|
+--------+--------+
| |
v v
+---------+ +---------+
| Partially| | Failed |
| Committed| +---------+
+---------+ |
| |
| v
| +---------+
| | Aborted |
| +---------+
v |
+---------+ |
| Committed| <------------+
+---------+
State Descriptions:
Active: Initial state; transaction is executing read/write operations.
Partially Committed: After final statement executes but before changes become permanent. Transaction has completed its work but may still be aborted.
Committed: Transaction successfully completed. All changes are permanent and visible to other transactions.
Failed: Transaction cannot proceed normally due to logical error or system issue.
Aborted: Transaction rolled back; database restored to state before transaction began. Can be restarted later.
Example Transaction States:
BEGIN TRANSACTION; -- Enter Active state
UPDATE accounts SET balance = balance - 100 WHERE account_id = 1;
UPDATE accounts SET balance = balance + 100 WHERE account_id = 2;
-- Last statement executed, enter Partially Committed
COMMIT; -- Enter Committed state
-- OR if error occurs
BEGIN TRANSACTION; -- Active
UPDATE accounts SET balance = balance - 100 WHERE account_id = 1;
-- System crash occurs here -- Enter Failed state
-- After recovery -- Enter Aborted stateA schedule is an ordering of operations from multiple transactions.
Types of Schedules:
Serial Schedule: Transactions execute one after another without interleaving.
T1: READ(A), WRITE(A), READ(B), WRITE(B)
T2: READ(A), WRITE(A), READ(B), WRITE(B)
Serializable Schedule: Interleaved execution that produces same result as some serial schedule.
T1: READ(A) -- T1 reads A
T2: READ(A) -- T2 reads A
T1: WRITE(A) -- T1 writes A
T2: WRITE(A) -- T2 writes A
T1: READ(B) -- T1 reads B
T1: WRITE(B) -- T1 writes B
T2: READ(B) -- T2 reads B
T2: WRITE(B) -- T2 writes B
Non-serializable Schedule: Interleaved execution that yields different result than any serial schedule.
Conflicting Operations:
Two operations conflict if:
- They belong to different transactions
- They access the same data item
- At least one is a write operation
Conflict types:
- Read-Write conflict (R-W): Transaction reads item written by another
- Write-Read conflict (W-R): Transaction writes item read by another
- Write-Write conflict (W-W): Both transactions write same item
Conflict Serializability:
A schedule is conflict serializable if it can be transformed into a serial schedule by swapping non-conflicting operations.
Precedence Graph Method:
- Create a node for each transaction
- Draw edge Ti → Tj if there's a conflicting operation from Ti before Tj
- Schedule is conflict serializable if graph has no cycles
Example:
Schedule S:
T1: READ(A)
T2: READ(A)
T1: WRITE(B)
T2: WRITE(A)
T1: WRITE(A)
T2: WRITE(B)
Find conflicts:
- T1 WRITE(B) and T2 WRITE(B) conflict (W-W) → T1 → T2
- T2 WRITE(A) and T1 WRITE(A) conflict (W-W) → T2 → T1
Graph has cycle T1→T2 and T2→T1 → Not conflict serializable.
View Serializability:
A schedule is view serializable if:
- For each data item, initial reads match some serial schedule
- For each data item, writes that produce values read match some serial schedule
- Final writes match some serial schedule
View serializability is less restrictive than conflict serializability. Some schedules are view serializable but not conflict serializable.
Example (View Serializable but not Conflict Serializable):
T1: WRITE(A)
T2: WRITE(A)
T1: WRITE(B)
T2: WRITE(B)
This schedule has write-write conflicts but produces same result as serial schedule T1,T2 if writes are idempotent.
Locks synchronize access to data items, ensuring serializability.
Lock Types:
Shared Lock (S): Transaction can read but not write. Multiple transactions can hold shared locks simultaneously.
Exclusive Lock (X): Transaction can read and write. Only one transaction can hold exclusive lock.
Lock Compatibility Matrix:
| Requested | Existing S | Existing X |
|---|---|---|
| S | Yes | No |
| X | No | No |
Lock Primitives:
- lock-S(A): Request shared lock on item A
- lock-X(A): Request exclusive lock on item A
- unlock(A): Release lock on item A
Example with Locking:
T1: lock-X(A) -- Get exclusive lock on A
T1: READ(A)
T1: WRITE(A)
T1: unlock(A) -- Release lock
T2: lock-S(A) -- Wait until T1 unlocks
T2: READ(A)
T2: unlock(A)
2PL ensures conflict serializability through a locking protocol with two phases.
Growing Phase: Transaction acquires locks but cannot release them.
Shrinking Phase: Transaction releases locks but cannot acquire new ones.
Strict 2PL: All exclusive locks held until commit/abort. Prevents cascading aborts.
Rigorous 2PL: All locks held until commit/abort. Most common implementation.
Example of 2PL:
T1: lock-X(A) -- Growing phase
T1: lock-S(B) -- Still growing
T1: READ(A)
T1: WRITE(A)
T1: READ(B)
T1: unlock(A) -- Start shrinking (violates strict 2PL)
T1: unlock(B)
Strict 2PL version:
T1: lock-X(A) -- Growing
T1: lock-S(B) -- Growing
T1: READ(A)
T1: WRITE(A)
T1: READ(B)
T1: COMMIT -- Shrinking happens after commit
-- Locks released automatically
Deadlock occurs when transactions wait indefinitely for each other's locks.
Deadlock Example:
T1: lock-X(A) -- T1 locks A
T2: lock-X(B) -- T2 locks B
T1: lock-X(B) -- T1 waits for T2's B
T2: lock-X(A) -- T2 waits for T1's A
-- Deadlock! Both waiting forever
Deadlock Detection:
Wait-for Graph:
- Nodes represent transactions
- Edge Ti → Tj if Ti waiting for lock held by Tj
- Cycle indicates deadlock
Detection Algorithm:
CREATE FUNCTION detect_deadlock() RETURNS boolean AS $$
DECLARE
cycle_found boolean := false;
BEGIN
-- Build wait-for graph and check for cycles
-- If cycle found, abort youngest transaction
RETURN cycle_found;
END;
$$ LANGUAGE plpgsql;Deadlock Prevention:
Wait-Die Scheme (non-preemptive):
- Older transaction T waits for younger T'
- Younger transaction T aborts if it needs lock held by older
Wound-Wait Scheme (preemptive):
- Older transaction T preempts (wounds) younger T'
- Younger transaction T waits if it needs lock held by older
Timeout-Based Prevention:
- Transaction aborts if wait exceeds timeout
- Simple but may abort unnecessarily
Example - Wound-Wait:
T1 (older) holds lock on A
T2 (younger) requests lock on A
Result: T2 aborts (wounded)
T2 (younger) holds lock on B
T1 (older) requests lock on B
Result: T2 aborts, T1 gets lock
Each transaction gets unique timestamp (usually system time). Schedule must be equivalent to timestamp order.
Basic Timestamp Ordering:
For each data item X, maintain:
- W-timestamp(X): Largest timestamp of transaction that wrote X
- R-timestamp(X): Largest timestamp of transaction that read X
Rules:
Read Operation: If TS(T) < W-timestamp(X), reject (read too late) Else allow read, update R-timestamp(X) = max(R-timestamp(X), TS(T))
Write Operation: If TS(T) < R-timestamp(X) OR TS(T) < W-timestamp(X), reject (write too late) Else allow write, update W-timestamp(X) = TS(T)
Example:
Initial: W-timestamp(A)=0, R-timestamp(A)=0
T1 (TS=100): READ(A) -- TS=100 > W-timestamp=0 → allowed
-- R-timestamp(A)=100
T2 (TS=200): WRITE(A) -- TS=200 > R-timestamp=100 → allowed
-- W-timestamp(A)=200
T3 (TS=150): WRITE(A) -- TS=150 < R-timestamp=100? No
-- TS=150 < W-timestamp=200? Yes → reject
Thomas Write Rule:
If TS(T) < W-timestamp(X), ignore write (it's obsolete) instead of rejecting.
MVCC maintains multiple versions of data items, allowing readers to see consistent snapshot without blocking writers.
How MVCC Works:
- Each write creates new version of data item
- Each version has timestamp (or transaction ID)
- Readers see snapshot of database at their start time
- Writers don't block readers; readers don't block writers
Version Management:
Each data item version contains:
- Value
- Create timestamp (transaction that created it)
- Delete timestamp (transaction that deleted it, if any)
Snapshot Isolation:
Transaction sees all versions committed before its start timestamp. No dirty reads, no non-repeatable reads, but may have write skew anomalies.
Example - PostgreSQL MVCC:
-- Transaction 1 (XID=100)
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
-- Creates new version of row with xmin=100
-- Transaction 2 (XID=101) starts after T1
BEGIN;
SELECT balance FROM accounts WHERE id = 1;
-- Sees old version (xmin < 101 and not deleted)
-- T1 commits
COMMIT;
-- T2 continues
SELECT balance FROM accounts WHERE id = 1;
-- Still sees old version (snapshot at T2 start)
COMMIT;Optimistic protocols assume conflicts are rare. Transactions proceed without locking, validate before commit.
Three Phases:
Read Phase:
- Transaction reads data, makes changes to private workspace
- No locking, no waiting
Validation Phase:
- Check if transaction conflicts with others
- Ensure serializability
Write Phase:
- If validation succeeds, make changes permanent
- If validation fails, abort and restart
Validation Techniques:
Timestamp-Based Validation: Assign each transaction timestamp at start (S) and validation (V) Check: For all conflicting transactions Tj with Vj < Vi:
- Either Tj finishes write phase before Ti starts
- Or Tj's write set doesn't intersect Ti's read set
Example:
T1: READ(A), READ(B) -- Private workspace
T2: READ(B), WRITE(B) -- Private workspace
Validation:
- T1's read set = {A,B}
- T2's write set = {B}
- Conflict on B → T2 must validate first or abort
SQL defines four isolation levels with increasing consistency.
Read Uncommitted:
- Dirty reads possible
- No locks (or very short locks)
- Highest performance, lowest consistency
Read Committed:
- No dirty reads
- Each query sees only committed data
- Non-repeatable reads possible
- Default in Oracle, PostgreSQL
Repeatable Read:
- No dirty reads, no non-repeatable reads
- Phantom reads possible
- Default in MySQL
Serializable:
- Highest isolation level
- Transactions appear serial
- No anomalies possible
Anomalies:
Dirty Read: Reading uncommitted data that may be rolled back.
Non-repeatable Read: Same query returns different values within transaction.
Phantom Read: New rows appear/disappear in subsequent queries.
Isolation Level Comparison:
| Level | Dirty Read | Non-repeatable | Phantom |
|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible |
| Read Committed | Not Possible | Possible | Possible |
| Repeatable Read | Not Possible | Not Possible | Possible |
| Serializable | Not Possible | Not Possible | Not Possible |
Setting Isolation Levels:
-- SQL Server
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
-- queries
COMMIT;
-- PostgreSQL
BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- queries
COMMIT;
-- MySQL
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
START TRANSACTION;
-- queries
COMMIT;Implementation Differences:
Read Committed in PostgreSQL:
- Uses MVCC
- Each statement gets new snapshot
- Ensures statement-level consistency
Serializable in PostgreSQL:
- Uses Serializable Snapshot Isolation (SSI)
- Detects serialization anomalies
- May abort transactions with conflicts
Database systems must handle various failure types.
Transaction Failures:
- Logical errors (division by zero, constraint violation)
- System errors (deadlock, resource exhaustion)
- User actions (explicit ROLLBACK)
System Crashes:
- Power failure
- Operating system crash
- DBMS software failure
- Hardware failure (CPU, memory)
Media Failures:
- Disk head crash
- Disk controller failure
- Bad sectors
- Accidental file deletion
Natural Disasters:
- Fire, flood, earthquake
- Theft, vandalism
- Power grid failure
Write-ahead logging (WAL) ensures durability and supports recovery.
Log Structure:
Each log record contains:
- LSN: Log Sequence Number (increasing)
- TransID: Transaction identifier
- PageID: Modified page identifier
- Old value: Before image (for undo)
- New value: After image (for redo)
- PrevLSN: Previous LSN for same transaction (linked list)
Log Record Types:
- UPDATE: Record of data modification
- COMMIT: Transaction committed
- ABORT: Transaction aborted
- CHECKPOINT: Checkpoint record
- CLR: Compensation Log Record (for undo)
Write-Ahead Logging Protocol:
- Log record must be written to stable storage before corresponding data page
- All log records for transaction must be written before COMMIT record
- Data pages can be written anytime after log records are stable
Example Log Sequence:
LSN 100: T1 BEGIN
LSN 101: T1 UPDATE (A, 100, 200) -- A changed from 100 to 200
LSN 102: T1 UPDATE (B, 50, 150) -- B changed from 50 to 150
LSN 103: T1 COMMIT
LSN 104: T2 BEGIN
LSN 105: T2 UPDATE (C, 75, 125)
-- Crash occurs here
Checkpoints reduce recovery time by establishing known-good database states.
Simple Checkpoint:
- Suspend all transactions
- Write all dirty (modified) buffers to disk
- Write checkpoint record to log
- Resume transactions
Problem: Suspending transactions hurts performance.
Fuzzy Checkpoint:
- Write checkpoint record to log (includes list of active transactions)
- Continue transactions normally
- Background process writes dirty buffers
- No suspension needed
Checkpoint Record Contents:
- List of active transactions
- Their most recent LSNs
- Dirty buffer information
Recovery with Checkpoint:
During recovery, only need to process log after the last checkpoint, plus any transactions active at checkpoint.
Alternative to log-based recovery using page table indirection.
How Shadow Paging Works:
-
Maintain two page tables:
- Current page table: Points to current database pages
- Shadow page table: Points to original pages (before transaction)
-
During transaction:
- All changes go to new pages (not overwriting original)
- Current page table updated to point to new pages
- Shadow page table unchanged
-
On commit:
- Make shadow page table point to new pages
- Discard original pages
-
On abort:
- Discard new pages
- Keep using shadow page table
Advantages:
- No logging overhead
- Fast recovery (no undo/redo)
- Simple implementation
Disadvantages:
- Data fragmentation
- Garbage collection needed
- Poor for large transactions
ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the most widely used recovery method.
Key Principles:
- Write-ahead logging: Log before writing data
- Repeating history during redo: Redo all operations from log
- Logging during undo: Log undo actions (CLRs) to avoid redoing undone work
ARIES Data Structures:
- Log: Sequential log records with LSNs
- Page LSN: LSN of last log record affecting page
- Dirty Page Table (DPT): Pages modified since last checkpoint
- Transaction Table: Active transactions and their last LSN
ARIES Recovery Phases:
Phase 1 - Analysis:
- Start from last checkpoint
- Scan log forward
- Build Transaction Table and Dirty Page Table
- Determine starting point for redo
Phase 2 - Redo:
- Start from smallest LSN in Dirty Page Table
- Scan log forward
- Reapply all updates (even for aborted transactions)
- Ensures database state matches logged history
Phase 3 - Undo:
- Scan backward from end
- Undo all transactions active at crash
- Write CLRs for each undo
- Continue until all aborted transactions undone
ARIES Example:
Checkpoint at LSN 100:
- Active transactions: T1 (last LSN=95)
- Dirty pages: P1 (LSN=95)
Log after checkpoint:
LSN 101: T1 UPDATE P2 -- T1 continues
LSN 102: T2 BEGIN
LSN 103: T2 UPDATE P1
LSN 104: T1 UPDATE P3
LSN 105: T2 UPDATE P2
-- Crash occurs
Recovery:
Analysis Phase:
- Transaction Table: T1(last=104), T2(last=105)
- Dirty Page Table: P2(LSN=101), P1(LSN=103), P3(LSN=104)
- Redo start = 101
Redo Phase:
- Redo LSN 101-105 (all operations)
Undo Phase:
- Undo T2 first (youngest)
- Undo LSN 105, write CLR 106
- Undo LSN 103, write CLR 107
- Undo T1 next
- Undo LSN 104, write CLR 108
- Undo LSN 101, write CLR 109
- Undo LSN 95 (from before checkpoint), write CLR 110
Backup Types:
Full Backup:
- Complete copy of entire database
- Time-consuming, large storage
- Baseline for recovery
Incremental Backup:
- Only changes since last backup
- Faster, smaller
- Requires full backup for restore
Differential Backup:
- All changes since last full backup
- Larger than incremental, faster restore
Log Backup:
- Transaction log files
- Enables point-in-time recovery
- Continuous or scheduled
Backup Strategies:
Strategy 1: Full + Differential
Sunday: Full backup
Monday: Differential (changes since Sunday)
Tuesday: Differential (changes since Sunday)
Wednesday: Differential
Thursday: Differential
Friday: Differential
Saturday: Differential
Restore: Full + most recent differential
Strategy 2: Full + Incremental
Sunday: Full backup
Monday: Incremental (Monday's changes)
Tuesday: Incremental (Tuesday's changes)
Wednesday: Incremental
Thursday: Incremental
Friday: Incremental
Saturday: Incremental
Restore: Full + all incrementals in order
Strategy 3: Full + Logs
Sunday 00:00: Full backup
Continuous: Transaction log backups every 15 minutes
Restore: Full + all log backups up to desired time
Point-in-Time Recovery:
-- PostgreSQL
pg_basebackup -D /backup/full
-- Then restore with recovery target time
recovery_target_time = '2024-01-15 14:30:00'
-- SQL Server
RESTORE DATABASE db FROM full_backup
WITH NORECOVERY;
RESTORE LOG db FROM log_backup
WITH RECOVERY, STOPAT = '2024-01-15 14:30:00';Backup Commands:
-- MySQL
mysqldump --all-databases --single-transaction > full_backup.sql
mysql < full_backup.sql
-- PostgreSQL
pg_dump dbname > dbname.sql
psql dbname < dbname.sql
-- Oracle
RMAN> BACKUP DATABASE PLUS ARCHIVELOG;
RMAN> RESTORE DATABASE;
RMAN> RECOVER DATABASE;Restore Testing:
Critical practice: Regularly test backups by restoring to test environment.
-- Create test database
CREATE DATABASE restore_test;
-- Restore backup to test
-- Run validation queries
-- Compare with production
DROP DATABASE restore_test;Database files organize data on disk for efficient access.
Storage Hierarchy:
- Primary Storage (Cache): CPU registers, L1/L2 cache (nanoseconds)
- Main Memory: RAM (microseconds)
- Flash Storage: SSD (microseconds to milliseconds)
- Magnetic Storage: HDD (milliseconds)
- Optical/Tape: Archive storage (seconds to minutes)
Database File Structure:
- Data files: Store actual table data
- Index files: Store index structures
- Control files: Database configuration and state
- Log files: Transaction logs for recovery
- Temp files: Sort/hash temporary storage
Pages/Blocks:
Database manages storage in fixed-size units (typically 4KB-64KB):
- Each page contains multiple records
- Page header with metadata
- Records stored within page
- Free space tracking
Heap files store records without any particular order.
Structure:
- New records appended at end
- Deleted records create gaps
- Free space management tracks available slots
Advantages:
- Fast inserts (append)
- Simple implementation
- Good for full table scans
Disadvantages:
- Slow searches (need full scan)
- Space fragmentation
- Inefficient for ordered access
Heap File Organization:
Page 0: [Header][Record1][Record2][Free Space]
Page 1: [Header][Record3][Free Space][Record4]
Page 2: [Header][Record5][Record6][Record7]
Free Space Management:
Methods to track free space:
- Linked list: Pages with free space linked together
- Bitmap: Bit per page indicating free space availability
- Space map: Separate structure tracking page free space
Sequential files maintain records in sorted order by search key.
Structure:
- Records ordered by key value
- Overflow pages for inserts
- Periodic reorganization
Advantages:
- Fast sequential access in key order
- Efficient range queries
- Binary search possible
Disadvantages:
- Slow inserts (need to maintain order)
- Overflow chains degrade performance
- Periodic reorganization needed
Maintaining Order:
Insert record with key=45 into sorted file:
Original: [10][20][30][40][50][60]
Insert 45:
- Find location (between 40 and 50)
- If space in page, insert
- If no space, create overflow page
- Link overflow to main page
Hash files distribute records across buckets using hash function.
Static Hashing:
- Fixed number of buckets
- Hash function maps key to bucket
- Simple but doesn't adapt to growth
Extendible Hashing:
- Dynamic bucket growth
- Directory of pointers to buckets
- Splits buckets when full
- Handles growing data gracefully
Linear Hashing:
- Gradual bucket growth
- No directory needed
- Round-robin splitting
- Good for concurrent access
Hash Function Properties:
- Deterministic: Same key always same hash
- Uniform: Keys evenly distributed
- Fast: Minimal computation overhead
Example - Extendible Hashing:
Global depth = 2
Directory: [00→A, 01→B, 10→C, 11→D]
Bucket A (depth=2): keys with hash 00: [k1, k4]
Bucket B (depth=2): keys with hash 01: [k2, k5]
Bucket C (depth=2): keys with hash 10: [k3]
Bucket D (depth=2): keys with hash 11: [k6, k7, k8]
Insert k9 with hash 10:
- Bucket C overflows
- Split bucket C into C (100) and E (101)
- Global depth becomes 3
- Directory doubles
RAID (Redundant Array of Independent Disks) combines multiple disks for performance and reliability.
RAID Levels:
RAID 0 (Striping):
- Data striped across disks
- No redundancy
- Best performance, no fault tolerance
RAID 1 (Mirroring):
- Data duplicated on two disks
- Excellent fault tolerance
- 50% storage efficiency
RAID 5 (Striping with Parity):
- Data striped with distributed parity
- Can survive one disk failure
- Good read performance
- Write penalty (must update parity)
RAID 6 (Striping with Double Parity):
- Two parity blocks per stripe
- Survives two disk failures
- Higher write penalty
RAID 10 (Striped Mirrors):
- Combination of RAID 1 and 0
- Excellent performance and fault tolerance
- 50% storage efficiency
- Most common for high-performance databases
RAID Comparison:
| Level | Min Disks | Read | Write | Fault Tolerance | Efficiency |
|---|---|---|---|---|---|
| RAID 0 | 2 | Excellent | Excellent | None | 100% |
| RAID 1 | 2 | Good | Good | 1 disk | 50% |
| RAID 5 | 3 | Good | Fair | 1 disk | 67-94% |
| RAID 6 | 4 | Good | Poor | 2 disks | 50-88% |
| RAID 10 | 4 | Excellent | Excellent | 1 per mirror | 50% |
Ordered indices maintain keys in sorted order for efficient search.
Primary Index:
- Defined on sorted file's ordering key
- One index entry per data block
- Sparse index
Clustering Index:
- Defined on non-key field with sorted data
- One index entry per distinct value or block
- Data physically ordered by index key
Secondary Index:
- Defined on any field
- One index entry per record
- Dense index
- Data not physically ordered
Index Structure:
Primary Index (sparse):
Key | Block Pointer
10 | Block 0
20 | Block 1
30 | Block 2
Data Blocks:
Block 0: [10][12][15][18]
Block 1: [20][22][25][28]
Block 2: [30][32][35][38]
B-Tree is a balanced tree structure maintaining sorted data.
Properties:
- Balanced tree (all leaves at same level)
- Each node contains multiple keys
- Nodes have pointers to children
- Keys in node are sorted
Node Structure:
- Contains between ceil(n/2) and n keys
- Internal nodes: keys + child pointers
- Leaf nodes: keys + data pointers (or data)
B-Tree of Order m:
- Root can have 2 to m children
- Internal nodes have ceil(m/2) to m children
- Leaves have ceil(m/2) to m keys
Example B-Tree (order 5):
[30, 50]
/ | \
/ | \
[10,20] [40,45] [60,70,80]
B+ Tree is the most common index structure in databases. It extends B-Tree with data only in leaves.
Differences from B-Tree:
- All data stored in leaves
- Internal nodes store only keys (routing values)
- Leaves linked for sequential access
- More keys per node (smaller tree)
B+ Tree Structure:
Internal Nodes (routing only):
[50, 100]
/ | \
/ | \
[25,40] [75,90] [120,150]
| | |
Leaf Nodes (data):
[10,15,20]→[25,30,35]→[40,45]→[50,55,60]→[75,80,85]→...
Advantages:
- Balanced (logarithmic search)
- Efficient range queries (linked leaves)
- High fanout (fewer levels)
- Good cache performance
Operations:
Search:
function search(key):
node = root
while node is not leaf:
find child pointer for key
node = child
search leaf for key
return data pointer or null
Insert:
function insert(key, data):
find leaf for key
if leaf has space:
insert in leaf
else:
split leaf into two
distribute keys evenly
promote middle key to parent
if parent overflows, split recursively
Delete:
function delete(key):
find and remove key from leaf
if leaf underflows:
try to borrow from sibling
if cannot borrow, merge with sibling
update parent keys
if parent underflows, propagate
Hash indexes use hash functions for direct access.
Static Hash Index:
- Fixed number of buckets
- Hash function maps key to bucket
- Buckets may have overflow chains
Extendible Hash Index:
- Dynamic bucket growth
- Directory of bucket pointers
- Splits buckets when full
Linear Hash Index:
- Gradual bucket addition
- No directory overhead
- Round-robin splitting
Hash Index Example:
Hash function: h(key) = key mod 4
Bucket 0: [4, 8, 12, 16] (overflow chain for 16)
Bucket 1: [1, 5, 9, 13]
Bucket 2: [2, 6, 10, 14]
Bucket 3: [3, 7, 11, 15]
Bitmap indexes use bit arrays for low-cardinality columns.
Structure:
- One bitmap per distinct value
- Each bit represents a row
- Bit set if row has that value
Example - Gender Column:
Rows: 1:M, 2:F, 3:M, 4:F, 5:M
Gender='M' bitmap: [1,0,1,0,1]
Gender='F' bitmap: [0,1,0,1,0]
Bitmap Operations:
- AND: Find rows with multiple conditions
- OR: Find rows with any condition
- NOT: Find rows without value
- COUNT: Fast population count
Query Example:
SELECT * FROM employees
WHERE gender = 'M' AND department = 'IT'
-- Use bitmaps:
-- Gender_M bitmap: [1,0,1,0,1]
-- Dept_IT bitmap: [1,1,0,0,1]
-- AND result: [1,0,0,0,1]
-- Rows 1 and 5 matchAdvantages:
- Fast for low-cardinality columns
- Efficient storage (bit compression)
- Fast boolean operations
R-Tree indexes multidimensional data (spatial, geometric).
Structure:
- Hierarchical organization of bounding boxes
- Each node has minimum bounding rectangle (MBR)
- Leaf nodes contain actual objects
- Internal nodes contain MBRs of children
R-Tree Properties:
- Balanced tree
- Overlapping regions allowed
- Minimizes area, overlap, margin
- Supports spatial queries
Spatial Queries:
Point Query: Find all objects containing point P Range Query: Find all objects within rectangle R Nearest Neighbor: Find k closest objects to point P
R-Tree Example:
Root MBR: covers entire space
├── Node1 MBR: covers Europe
│ ├── Leaf: Paris (point)
│ ├── Leaf: London (point)
│ └── Leaf: Berlin (point)
└── Node2 MBR: covers Asia
├── Leaf: Tokyo (point)
├── Leaf: Beijing (point)
└── Leaf: Seoul (point)
Choosing appropriate indexes is crucial for performance.
Factors to Consider:
Query Patterns:
- Which columns in WHERE clauses?
- Which columns in JOIN conditions?
- Which columns in ORDER BY?
- Which columns in GROUP BY?
Data Characteristics:
- Cardinality (number of distinct values)
- Data distribution (skew)
- Update frequency
- Table size
Index Types by Use Case:
Primary Key/Unique Constraint:
-- Automatic unique index
CREATE TABLE users (
user_id INT PRIMARY KEY,
email VARCHAR(200) UNIQUE
);Foreign Keys:
-- Index foreign key columns
CREATE INDEX idx_orders_customer ON orders(customer_id);Search Conditions:
-- Single column for equality
CREATE INDEX idx_users_email ON users(email);
-- Composite for multiple conditions
CREATE INDEX idx_orders_customer_date
ON orders(customer_id, order_date);Range Queries:
-- Index on range column
CREATE INDEX idx_products_price ON products(price);
-- Composite with equality first, then range
CREATE INDEX idx_orders_customer_date
ON orders(customer_id, order_date);Sorting/Ordering:
-- Index supports ORDER BY without sort
CREATE INDEX idx_users_registration ON users(registration_date DESC);Covering Indexes:
-- Index includes all needed columns
CREATE INDEX idx_users_email_name
ON users(email) INCLUDE (first_name, last_name);Partial Indexes:
-- Index only active users
CREATE INDEX idx_active_users_email
ON users(email) WHERE status = 'active';Index Selection Guidelines:
-
Start with obvious indexes: PK, FK, frequent query columns
-
Analyze actual queries: Use query logs to identify patterns
-
Consider selectivity: High-cardinality columns benefit most
-
Balance reads vs writes: Each index adds write overhead
-
Monitor and adjust: Use database statistics and query plans
-
Avoid over-indexing: Too many indexes hurts INSERT/UPDATE
-
Test with realistic data: Development data may not reflect production
Index Maintenance:
-- Rebuild fragmented indexes
ALTER INDEX idx_name REBUILD;
-- Update statistics
ANALYZE TABLE table_name;
-- Monitor index usage (PostgreSQL)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
ORDER BY idx_scan;Parsing converts SQL text into internal representation.
Parsing Phases:
Lexical Analysis:
- Break SQL into tokens (keywords, identifiers, literals)
- Identify language elements
- Report lexical errors
Syntax Analysis:
- Check grammar rules
- Build parse tree
- Validate SQL structure
- Report syntax errors
Semantic Analysis:
- Verify objects exist
- Check column references
- Validate data types
- Resolve views and synonyms
- Check permissions
Parse Tree Example:
SELECT s.name, c.title
FROM students s
JOIN enrollments e ON s.id = e.student_id
JOIN courses c ON e.course_id = c.id
WHERE e.grade = 'A'Parse tree representation:
Query
├── SELECT
│ ├── Column: s.name
│ └── Column: c.title
├── FROM
│ ├── Table: students AS s
│ ├── JOIN: enrollments AS e ON s.id = e.student_id
│ └── JOIN: courses AS c ON e.course_id = c.id
└── WHERE
└── Condition: e.grade = 'A'
Translation converts parse tree to relational algebra.
Translation Rules:
SELECT:
SELECT * FROM students WHERE major = 'CS'→ σ_major='CS'(students)
JOIN:
SELECT * FROM students s, enrollments e
WHERE s.id = e.student_id→ s ⋈_{s.id=e.student_id} e
GROUP BY:
SELECT major, COUNT(*) FROM students GROUP BY major→ γ_major, COUNT(*)(students)
Combined:
SELECT s.name, c.title
FROM students s, enrollments e, courses c
WHERE s.id = e.student_id AND e.course_id = c.id AND e.grade = 'A'→ π_name,title(σ_grade='A'(s ⋈ e ⋈ c))
Logical plan represents query operations without physical details.
Logical Operators:
- Scan: Read table (full or index)
- Select: Filter rows (σ)
- Project: Choose columns (π)
- Join: Combine tables (⋈)
- Aggregate: Group and aggregate (γ)
- Sort: Order rows (τ)
- Distinct: Remove duplicates (δ)
Logical Plan Example:
Query:
SELECT department, AVG(salary)
FROM employees
WHERE hire_date > '2020-01-01'
GROUP BY department
HAVING COUNT(*) > 5Logical plan:
π_department, avg_salary
│
γ_department, AVG(salary)→avg_salary, COUNT(*)→cnt
│
σ_hire_date > '2020-01-01'
│
employees
Physical plan adds implementation details to logical plan.
Physical Operators:
- Table Scan: Sequential read of entire table
- Index Scan: Read using index
- Index Seek: Direct lookup via index
- Nested Loops Join: For each outer row, scan inner
- Hash Join: Build hash table on one input
- Merge Join: Sort both inputs then merge
- Sort: Explicit sorting (may use index)
- Aggregate: Hash or sort-based grouping
Physical Plan Example:
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.state = 'CA'Physical plan options:
Option 1 (Nested Loops):
Nested Loops Join
├── Index Seek on customers (state='CA')
└── Index Seek on orders (customer_id=@id)
Option 2 (Hash Join):
Hash Join
├── Index Seek on customers (state='CA')
└── Table Scan on orders
Option 3 (Merge Join):
Merge Join
├── Sort (customers filtered by state='CA')
└── Sort (orders)
Nested Loop Join:
for each row r in outer table:
for each row s in inner table:
if r.key = s.key:
output (r, s)Variations:
- Simple nested loop: O(n*m)
- Index nested loop: Inner table indexed → O(n * log m)
- Block nested loop: Process in blocks → better cache usage
Sort-Merge Join:
// Sort both tables on join key
sort(outer, key)
sort(inner, key)
// Merge
i = 0, j = 0
while i < outer.length and j < inner.length:
if outer[i].key < inner[j].key:
i++
else if outer[i].key > inner[j].key:
j++
else:
// Match found
output all combinations
i++, j++Complexity: O(n log n + m log m + n + m)
Hash Join:
// Build phase
for each row s in inner table:
hash = hash(s.key)
add s to hash table[hash]
// Probe phase
for each row r in outer table:
hash = hash(r.key)
for each s in hash table[hash]:
if r.key = s.key:
output (r, s)Complexity: O(n + m) average case
Join Algorithm Selection:
| Scenario | Recommended Algorithm |
|---|---|
| One table small | Nested Loop |
| Both tables large, sorted | Merge Join |
| Both tables large, unsorted | Hash Join |
| Index available | Index Nested Loop |
| Highly skewed data | Consider hash or merge |
Cost-based optimizer estimates execution costs and chooses cheapest plan.
Cost Components:
I/O Cost:
- Sequential reads (cheaper)
- Random reads (more expensive)
- Writing intermediate results
CPU Cost:
- Tuple processing
- Predicate evaluation
- Sorting/comparisons
- Hash computations
Memory Cost:
- Buffer pool usage
- Sort/hash memory
- Temporary storage
Network Cost:
- Data transfer between nodes
- Distributed query overhead
Cost Estimation:
-- Table statistics
SELECT
reltuples AS row_count,
relpages AS page_count
FROM pg_class
WHERE relname = 'employees';
-- Column statistics
SELECT
n_distinct AS distinct_values,
most_common_vals,
most_common_freqs
FROM pg_stats
WHERE tablename = 'employees' AND attname = 'department';Example Cost Calculation:
Table employees: 100,000 rows, 1,000 pages Index on department: height=3, 50 distinct values
Query: SELECT * FROM employees WHERE department = 'IT' (10% of rows)
Full Table Scan Cost:
- Read all pages: 1,000 I/Os
- CPU: evaluate predicate on 100,000 rows
- Total: ~1,000 I/O cost units
Index Scan Cost:
- Traverse index: 3 I/Os
- Read matching rows: 10,000 rows (maybe 100 pages if clustered)
- Total: 103 I/Os + random I/O penalty
Heuristic optimization applies rules without detailed cost analysis.
Common Heuristics:
- Push selections down: Apply filters early
- Project early: Reduce tuple size
- Reorder joins: Smaller intermediate results first
- Combine operations: Eliminate unnecessary steps
Heuristic Transformation Example:
Original:
π_name(σ_department='IT'(employees ⋈ departments))
Apply heuristics:
- Push selection: σ_department='IT' before join
- Project early: π_name after join
Result:
π_name((σ_department='IT'(employees)) ⋈ departments)
Join Order Heuristics:
Given joins: A ⋈ B ⋈ C ⋈ D
Heuristic rules:
- Smallest relations first
- Most selective filters first
- Consider available indexes
Statistics help optimizer estimate result sizes.
Table-Level Statistics:
- Number of rows (cardinality)
- Number of pages (size)
- Average row length
- Last analyzed timestamp
Column-Level Statistics:
- Number of distinct values
- Null fraction
- Most common values
- Data distribution
Histograms:
Equi-width histogram:
Value Range: 0-100
Bucket 1: 0-20 (500 rows)
Bucket 2: 21-40 (200 rows)
Bucket 3: 41-60 (800 rows)
Bucket 4: 61-80 (400 rows)
Bucket 5: 81-100 (600 rows)
Equi-height histogram:
Each bucket has ~500 rows
Bucket 1: 0-15
Bucket 2: 16-35
Bucket 3: 36-45
Bucket 4: 46-70
Bucket 5: 71-100
Most Common Values (MCV):
Value 'IT' : occurs 2000 times (10%)
Value 'Sales' : occurs 1800 times (9%)
Value 'Eng' : occurs 1500 times (7.5%)
Other values : 73.5% spread across 47 values
Understanding execution plans is crucial for query tuning.
Reading Execution Plans:
-- PostgreSQL
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE customer_id = 123;
-- Output:
Index Scan using idx_orders_customer on orders
(cost=0.29..8.31 rows=1 width=32)
(actual time=0.023..0.025 rows=1 loops=1)
Index Cond: (customer_id = 123)
Buffers: shared hit=4
Planning Time: 0.123 ms
Execution Time: 0.045 msPlan Components:
- Operation: What's being done (Index Scan, Hash Join)
- Cost: Estimated cost (startup..total)
- Rows: Estimated/actual row count
- Width: Average row size in bytes
- Actual time: Actual execution time
- Loops: How many times operation executed
- Buffers: Buffer cache usage
Common Plan Issues:
Sequential Scan on Large Table:
Seq Scan on orders (cost=0.00..8341.00 rows=100000 width=40)
Filter: (total > 1000)
Solution: Add index on total
Nested Loops on Large Tables:
Nested Loop (cost=1000.00..50000.00 rows=5000)
-> Seq Scan on large_table1
-> Index Scan on large_table2
Solution: Consider hash or merge join
Sort Without Index:
Sort (cost=5000.00..5125.00 rows=50000)
Sort Key: order_date
-> Seq Scan on orders
Solution: Index on order_date
Distributed queries add complexity of data location and network costs.
Challenges:
- Data fragmentation across nodes
- Network transmission costs
- Local processing costs vary
- Site autonomy constraints
Optimization Techniques:
Query Shipping: Move query to data
Execute query fragment at each site
Combine results centrally
Data Shipping: Move data to query
Transfer needed tables to central site
Execute query locally
Hybrid Approaches:
- Ship only relevant fragments
- Semijoin reduction
- Bloom filters
Distributed Query Plan Example:
Query: SELECT * FROM customers c JOIN orders o ON c.id = o.customer_id Data: customers in NY, orders in London
Options:
-
Ship customers to London:
- Transfer customers (100MB) to London
- Execute join in London
- Return results (10MB)
-
Ship orders to NY:
- Transfer orders (1GB) to NY
- Execute join in NY
- Return results (10MB)
-
Semijoin reduction:
- NY sends customer IDs to London
- London filters orders by those IDs
- London sends filtered orders (100MB) to NY
- NY performs join
Cost estimation includes:
- Data transfer volume
- Network bandwidth
- Local processing costs
- Result size
A distributed database is a single logical database physically distributed across multiple computers.
Characteristics:
- Data stored at multiple sites
- Sites connected by network
- Each site has local autonomy
- Distributed Transaction Management
- Location transparency
Advantages:
- Reliability and availability
- Scalability
- Data locality
- Economics (cheaper than centralized mainframe)
Disadvantages:
- Complexity
- Cost of network communication
- Security challenges
- Integrity control harder
Transparency Levels:
Fragmentation Transparency: User unaware of data fragmentation
Location Transparency: User unaware of data location
Replication Transparency: User unaware of data replication
Failure Transparency: System handles failures transparently
Fragmentation divides relations into smaller pieces stored at different sites.
Horizontal Fragmentation:
Split relation by rows based on predicates.
Customers (customer_id, name, country, credit_limit)
Fragment 1 (Europe): country IN ('UK', 'FR', 'DE')
Fragment 2 (Americas): country IN ('US', 'CA', 'BR')
Fragment 3 (Asia): country IN ('JP', 'CN', 'IN')
Vertical Fragmentation:
Split relation by columns.
Employee (emp_id, name, salary, dept_id, health_info)
Fragment 1 (Public): emp_id, name, dept_id
Fragment 2 (HR only): emp_id, salary
Fragment 3 (Medical): emp_id, health_info
Mixed Fragmentation:
Combine horizontal and vertical.
First vertical: Employee_Public (emp_id, name, dept_id)
Then horizontal: Dept10_Employees, Dept20_Employees
Correctness Rules:
- Completeness: All data appears somewhere
- Reconstruction: Can reconstruct original relation
- Disjointness: Fragments don't overlap unnecessarily
Replication maintains copies of data at multiple sites.
Full Replication:
- Complete copy at every site
- Highest availability
- Update overhead
Partial Replication:
- Selected fragments replicated
- Balance availability and overhead
No Replication:
- Each fragment at one site
- Simplest, least overhead
- Single point of failure
Replication Timing:
Synchronous Replication:
- All copies updated in same transaction
- Strong consistency
- Higher latency
Asynchronous Replication:
- Updates propagate after commit
- Better performance
- Temporary inconsistency
Master-Slave Replication:
Master Site: Accepts all writes
↓ asynchronous
Slave 1: Read-only copy
Slave 2: Read-only copy
Slave 3: Read-only copy
Multi-Master Replication:
Site A: Accepts writes
↓ bidirectional sync
Site B: Accepts writes
↓ bidirectional sync
Site C: Accepts writes
Distributed transactions span multiple sites while maintaining ACID properties.
Two-Phase Commit (2PC) Protocol:
Phase 1 - Prepare:
Coordinator to all participants:
"Prepare to commit transaction T"
Each participant:
- Check if can commit
- Write prepare record to log
- Reply YES or NO
Phase 2 - Commit/Abort:
If all replied YES:
Coordinator writes commit record
Broadcast COMMIT to all participants
Else:
Coordinator writes abort record
Broadcast ABORT to all participants
Each participant:
- Complete transaction
- Write commit/abort record
- Acknowledge
2PC States:
Initial → Prepare Sent → Waiting →
if all YES → Commit Sent → Committed
if any NO → Abort Sent → Aborted
Problems with 2PC:
- Blocking: Coordinator failure blocks participants
- Performance: Multiple network round trips
- Scalability: Coordinator becomes bottleneck
Detailed 2PC Protocol:
Step 1: Application sends COMMIT to coordinator
Step 2: Coordinator sends PREPARE to all participants
Step 3: Each participant:
- Force-write prepare record to log
- Send VOTE_COMMIT or VOTE_ABORT
Step 4: Coordinator collects votes
- If all VOTE_COMMIT: write COMMIT record, send COMMIT
- If any VOTE_ABORT: write ABORT record, send ABORT
Step 5: Participants:
- Receive decision
- Force-write decision to log
- Complete transaction
- Send ACK
Step 6: Coordinator receives all ACKs
- Write end record (optional)
- Forget transaction
Failure Handling:
Participant failure during prepare:
- Coordinator times out → abort
Coordinator failure during commit:
- Participants query each other
- Use consensus to determine outcome
Network partition:
- Timeout-based decisions
- May need human intervention
3PC adds an extra phase to reduce blocking problems.
Phase 1 - CanCommit: Similar to 2PC prepare, but no final decision yet.
Phase 2 - PreCommit: If all ready, coordinator sends pre-commit. Participants acknowledge and prepare to commit.
Phase 3 - DoCommit: Coordinator sends final commit.
Advantages:
- Non-blocking if coordinator fails during commit
- Participants can timeout and decide
Disadvantages:
- More messages
- Still vulnerable to network partitions
- Complex implementation
Scalability Challenges:
- Vertical scaling only (bigger servers)
- Difficult to distribute across many nodes
- Connection pooling limits
Schema Rigidity:
- Fixed schema requires migrations
- Semi-structured data difficult
- Evolving requirements hard
Performance:
- JOIN overhead at scale
- ACID overhead for simple operations
- Not optimized for specific access patterns
Data Model Mismatch:
- Object-relational impedance mismatch
- Complex mapping for nested data
- Aggregation-heavy queries slow
CAP theorem states distributed systems can only guarantee two of three properties:
Consistency (C): All nodes see same data simultaneously
Availability (A): Every request receives response (success/failure)
Partition Tolerance (P): System continues despite network partitions
CAP Combinations:
| System Type | Properties | Examples |
|---|---|---|
| CP | Consistency + Partition | HBase, MongoDB (default) |
| AP | Availability + Partition | Cassandra, CouchDB |
| CA | Consistency + Availability | Traditional RDBMS (single node) |
PACELC Extension:
- If Partition (P): trade C vs A
- Else (E): trade Latency (L) vs Consistency (C)
Characteristics:
- Simple data model: key → value
- Values are opaque (blobs)
- High performance
- Horizontal scaling
Operations:
- GET(key) → value
- PUT(key, value)
- DELETE(key)
- Batch operations
Examples:
Redis:
SET user:1001 "{name:'Alice',email:'alice@example.com'}"
GET user:1001
HSET user:1001 name "Alice"
HGET user:1001 name
Amazon DynamoDB:
# Python with boto3
table.put_item(
Item={
'user_id': '1001',
'name': 'Alice',
'email': 'alice@example.com'
}
)
response = table.get_item(Key={'user_id': '1001'})Use Cases:
- Session storage
- User profiles
- Shopping carts
- Real-time leaderboards
- Caching layer
Characteristics:
- Store semi-structured documents (JSON, BSON, XML)
- Schema-flexible
- Nested data structures
- Document-level operations
MongoDB Example:
// Insert document
db.users.insertOne({
name: "Alice Johnson",
email: "alice@example.com",
address: {
street: "123 Main St",
city: "Boston",
zip: "02101"
},
orders: [
{ product: "laptop", price: 1200 },
{ product: "mouse", price: 25 }
]
})
// Query
db.users.find({ "address.city": "Boston" })
// Update
db.users.updateOne(
{ email: "alice@example.com" },
{ $push: { orders: { product: "keyboard", price: 75 } } }
)
// Index
db.users.createIndex({ "address.city": 1 })Query Features:
- Field selection
- Range queries
- Regular expressions
- Geospatial queries
- Aggregation pipeline
Aggregation Pipeline:
db.orders.aggregate([
{ $match: { status: "completed" } },
{ $group: { _id: "$customer_id", total: { $sum: "$amount" } } },
{ $sort: { total: -1 } },
{ $limit: 10 }
])Characteristics:
- Store data in column families
- Each row can have different columns
- Optimized for wide tables
- Good for time-series data
Cassandra Data Model:
-- Keyspace (database)
CREATE KEYSPACE store
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3};
-- Column family (table)
CREATE TABLE users (
user_id UUID PRIMARY KEY,
name TEXT,
email TEXT,
created_at TIMESTAMP
);
-- Wide row design for time series
CREATE TABLE sensor_data (
sensor_id TEXT,
timestamp TIMESTAMP,
temperature FLOAT,
humidity FLOAT,
pressure FLOAT,
PRIMARY KEY (sensor_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
-- Queries
INSERT INTO users (user_id, name, email)
VALUES (uuid(), 'Alice', 'alice@example.com');
SELECT * FROM sensor_data
WHERE sensor_id = 'sensor_01'
AND timestamp >= '2024-01-01';Cassandra Features:
- Tunable consistency
- Linear scalability
- No single point of failure
- Optimized for writes
Characteristics:
- Nodes (entities) and edges (relationships)
- Property graphs (attributes on nodes/edges)
- Optimized for relationship traversal
- Pattern matching queries
Neo4j Example:
// Create nodes
CREATE (alice:Person {name: 'Alice', age: 30})
CREATE (bob:Person {name: 'Bob', age: 32})
CREATE (product:Product {name: 'Laptop', price: 1200})
// Create relationships
CREATE (alice)-[:KNOWS]->(bob)
CREATE (alice)-[:BOUGHT {date: '2024-01-15'}]->(product)
// Query friends of friends
MATCH (alice:Person {name: 'Alice'})-[:KNOWS*2]->(friend)
RETURN DISTINCT friend.name
// Find purchase recommendations
MATCH (user:Person {name: 'Alice'})-[:BOUGHT]->(product)<-[:BOUGHT]-(other)-[:BOUGHT]->(rec)
WHERE NOT (user)-[:BOUGHT]->(rec)
RETURN rec.name, COUNT(*) AS frequency
ORDER BY frequency DESC
LIMIT 5
// Shortest path
MATCH p = shortestPath(
(alice:Person {name: 'Alice'})-[:KNOWS*]-(charlie:Person {name: 'Charlie'})
)
RETURN pUse Cases:
- Social networks
- Recommendation engines
- Fraud detection
- Network/IT operations
- Knowledge graphs
| Aspect | SQL | NoSQL |
|---|---|---|
| Data Model | Tables, rows | Various (doc, graph, key-value) |
| Schema | Fixed, rigid | Flexible, dynamic |
| ACID | Full ACID | Varies (eventual consistency common) |
| Joins | Supported | Limited or none |
| Scaling | Vertical | Horizontal |
| Query Language | SQL | API or specific language |
| Transactions | Multi-row ACID | Often single-document only |
| Use Cases | Complex queries, strict consistency | High volume, flexible schema |
NewSQL combines SQL familiarity with NoSQL scalability.
Characteristics:
- SQL interface
- ACID transactions across nodes
- Horizontal scaling
- Distributed query execution
Key Features:
- Automatic sharding
- Distributed query optimization
- Consensus-based replication
- Global consistency
Sharding Strategies:
Range-Based Sharding:
Shard 1: IDs 1-10000
Shard 2: IDs 10001-20000
Shard 3: IDs 20001-30000
Hash-Based Sharding:
shard = hash(key) % N
Directory-Based Sharding: Maintain lookup table for key→shard mapping.
Consistent Hashing: Minimal reshuffling when nodes added/removed.
CockroachDB:
-- CockroachDB extends PostgreSQL syntax
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name STRING,
email STRING UNIQUE,
created_at TIMESTAMP DEFAULT now()
);
-- Automatic distribution
SHOW RANGES FROM TABLE users;
-- Distributed transactions
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 'abc';
UPDATE accounts SET balance = balance + 100 WHERE id = 'xyz';
COMMIT; -- Atomic across nodes
-- Geo-partitioning for data locality
ALTER TABLE european_customers
CONFIGURE ZONE USING constraints = '[+region=europe]';Google Spanner:
Features:
- Global distribution
- TrueTime API for external consistency
- SQL interface
- Automatic replication
- Strong consistency
-- Spanner SQL
CREATE TABLE Users (
UserId INT64 NOT NULL,
Name STRING(MAX),
Email STRING(MAX)
) PRIMARY KEY (UserId);
-- Interleave tables for locality
CREATE TABLE Posts (
UserId INT64 NOT NULL,
PostId INT64 NOT NULL,
Content STRING(MAX)
) PRIMARY KEY (UserId, PostId),
INTERLEAVE IN PARENT Users ON DELETE CASCADE;YugabyteDB:
- PostgreSQL-compatible
- Cassandra-inspired storage
- Distributed ACID transactions
- Multi-region deployment
OLTP (Online Transaction Processing):
- Many small transactions
- Current data
- Normalized schema
- Row-oriented storage
- Fast queries/writes
OLAP (Online Analytical Processing):
- Few complex queries
- Historical data
- Denormalized schema
- Column-oriented storage
- Aggregations, summaries
Comparison:
| Aspect | OLTP | OLAP |
|---|---|---|
| Users | Clerks, customers | Analysts, executives |
| Operations | Read/update | Read mostly |
| Query type | Simple, repetitive | Complex, ad-hoc |
| Data volume | GB | TB to PB |
| Response time | Milliseconds | Seconds to minutes |
| Design | Application-oriented | Subject-oriented |
Star schema is a denormalized data warehouse design.
Components:
Fact Table:
- Measures (numeric, additive)
- Foreign keys to dimensions
- Large (millions of rows)
Dimension Tables:
- Descriptive attributes
- Denormalized
- Smaller than fact
Star Schema Example:
Fact_Sales
├── date_key (FK to Dim_Date)
├── product_key (FK to Dim_Product)
├── customer_key (FK to Dim_Customer)
├── store_key (FK to Dim_Store)
├── quantity_sold
├── unit_price
└── total_amount
Dim_Date: date_key, date, year, quarter, month, day
Dim_Product: product_key, product_name, category, brand, supplier
Dim_Customer: customer_key, name, address, city, state, segment
Dim_Store: store_key, store_name, region, manager
Query Example:
SELECT
d.year,
p.category,
SUM(f.total_amount) AS sales
FROM Fact_Sales f
JOIN Dim_Date d ON f.date_key = d.date_key
JOIN Dim_Product p ON f.product_key = p.product_key
WHERE d.year = 2024
GROUP BY d.year, p.category
ORDER BY sales DESC;Snowflake schema normalizes dimension tables.
Snowflake Example:
Dim_Product
├── product_key
├── product_name
├── category_key (FK to Dim_Category)
└── supplier_key (FK to Dim_Supplier)
Dim_Category: category_key, category_name, department
Dim_Supplier: supplier_key, supplier_name, city, country
Advantages:
- Less redundancy
- Easier maintenance
- Space savings
Disadvantages:
- More joins
- Slower queries
- More complex
ETL (Extract, Transform, Load) moves data from sources to warehouse.
Extract:
- Read from source systems
- Handle different formats
- Incremental or full load
- Change data capture
Transform:
- Clean data (nulls, defaults)
- Validate business rules
- Aggregate/calculate
- Conform dimensions
- Handle slowly changing dimensions
Load:
- Insert into warehouse
- Update existing records
- Build indexes
- Update aggregates
- Verify counts
ETL Example (Python with pandas):
import pandas as pd
from sqlalchemy import create_engine
# Extract
df = pd.read_csv('sales_data.csv')
# Transform
df['order_date'] = pd.to_datetime(df['order_date'])
df['total'] = df['quantity'] * df['price']
df = df.drop_duplicates()
df = df[df['total'] > 0]
# Load
engine = create_engine('postgresql://user:pass@localhost/warehouse')
df.to_sql('staging_sales', engine, if_exists='replace')
# Run SQL transformations
with engine.connect() as conn:
conn.execute("""
INSERT INTO fact_sales
SELECT s.*, d.date_key, p.product_key
FROM staging_sales s
JOIN dim_date d ON s.order_date = d.date
JOIN dim_product p ON s.product_id = p.product_id
""")Data marts are subject-specific data warehouse subsets.
Types:
Dependent Data Mart:
- Derived from enterprise warehouse
- Consistent data
- Single source of truth
Independent Data Mart:
- Built directly from sources
- Faster to build
- Potential inconsistency
Data Mart Examples:
Sales Mart:
- Fact_Sales
- Dim_Product
- Dim_Customer
- Dim_Date
- Dim_Store
Inventory Mart:
- Fact_Inventory
- Dim_Product
- Dim_Warehouse
- Dim_Date
OLAP operations enable interactive data analysis.
Roll-up (Drill-up): Increase aggregation level.
Daily → Monthly → Quarterly → Yearly
City → State → Region → Country
Product → Category → Department
Drill-down: Decrease aggregation level (more detail).
Yearly → Quarterly → Monthly → Daily
Slice: Select one dimension value.
Sales for 2024 only
Dice: Select multiple dimension values.
Sales for 2024, Product Category 'Electronics', Region 'West'
Pivot (Cross-tab): Reorient view (rows to columns).
-- Pivot example
SELECT *
FROM (
SELECT year, quarter, total
FROM sales
) src
PIVOT (
SUM(total)
FOR quarter IN ('Q1' AS Q1, 'Q2' AS Q2, 'Q3' AS Q3, 'Q4' AS Q4)
) pvt;Drill-across: Query multiple fact tables.
SELECT d.year,
s.total_sales,
i.total_inventory
FROM dim_date d
LEFT JOIN sales_fact s ON d.date_key = s.date_key
LEFT JOIN inventory_fact i ON d.date_key = i.date_keyThe 5 V's:
Volume: Massive data scale (terabytes to petabytes)
Velocity: High-speed data generation (real-time streams)
Variety: Diverse data types (structured, semi-structured, unstructured)
Veracity: Data quality and trustworthiness
Value: Business value from analysis
Hadoop Components:
HDFS (Hadoop Distributed File System):
- Master/NameNode: Manages metadata
- DataNodes: Store actual data
- Block replication (default 3x)
- Rack awareness
YARN (Yet Another Resource Negotiator):
- ResourceManager: Cluster resource management
- NodeManager: Per-node resource management
- ApplicationMaster: Per-application coordination
HDFS Write Pipeline:
Client → NameNode (get block locations)
Client → DataNode1 (write block)
DataNode1 → DataNode2 (replicate)
DataNode2 → DataNode3 (replicate)
MapReduce processes data in parallel across cluster.
Map Phase:
- Input split into chunks
- Map function processes each record
- Output intermediate key-value pairs
Shuffle Phase:
- Sort and group by key
- Transfer data to reducers
Reduce Phase:
- Process grouped values
- Output final results
Word Count Example:
// Mapper
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
// Reducer
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}Spark provides in-memory cluster computing.
RDD (Resilient Distributed Dataset):
- Immutable distributed collection
- Partitioned across cluster
- Can be rebuilt if lost
- Two types: transformations and actions
Spark SQL:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Example") \
.getOrCreate()
# Read data
df = spark.read.json("sales.json")
# Transformations
result = df.filter(df.amount > 100) \
.groupBy("product") \
.agg({"amount": "sum"}) \
.orderBy("sum(amount)", ascending=False)
# Actions
result.show()
result.write.parquet("output/")Spark Streaming:
streaming_df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "sales-topic") \
.load()
# Process in micro-batches
counts = streaming_df \
.groupBy(window("timestamp", "1 hour"), "product") \
.count()
query = counts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()Data lakes store raw data in native format.
Characteristics:
- Store all data (structured/unstructured)
- Schema-on-read
- Centralized repository
- Supports various processing engines
Data Lake Architecture:
Data Sources → Ingestion Layer → Storage Layer → Processing → Consumption
(IoT, logs, DBs) (Kafka, Flume) (HDFS/S3) (Spark, Hive) (BI, ML)
Data Lake vs Data Warehouse:
| Aspect | Data Lake | Data Warehouse |
|---|---|---|
| Data | Raw, any format | Processed, structured |
| Schema | On read | On write |
| Users | Data scientists | Business analysts |
| Purpose | Exploration | Reporting |
| Storage | Cheap object storage | Expensive DBMS |
| Quality | Raw, may be messy | Clean, trusted |
External Threats:
- SQL injection
- Network eavesdropping
- Unauthorized access
- Denial of service
Internal Threats:
- Privilege misuse
- Data theft by employees
- Accidental data exposure
- Weak passwords
Environmental Threats:
- Natural disasters
- Power failures
- Hardware failures
- Backup compromise
Discretionary Access Control (DAC):
- Object owner controls access
- GRANT/REVOKE privileges
- SQL standard
Mandatory Access Control (MAC):
- System-wide policies
- Labels and classifications
- Military/Government use
Role-Based Access Control (RBAC):
- Permissions assigned to roles
- Users assigned to roles
- Most common in enterprises
RBAC Example:
-- Create roles
CREATE ROLE sales_read;
CREATE ROLE sales_write;
CREATE ROLE manager;
-- Grant permissions to roles
GRANT SELECT ON customers TO sales_read;
GRANT SELECT, INSERT, UPDATE ON orders TO sales_write;
GRANT ALL PRIVILEGES ON sales.* TO manager;
-- Assign users to roles
GRANT sales_read TO user_alice;
GRANT sales_write TO user_bob;
GRANT manager TO user_carol;SQL injection inserts malicious SQL through application input.
Vulnerable Code:
# Dangerous - don't do this
user_input = request.GET['username']
query = f"SELECT * FROM users WHERE username = '{user_input}'"
cursor.execute(query)
# Attacker input: ' OR '1'='1
# Result: SELECT * FROM users WHERE username = '' OR '1'='1'
# Returns all users!Prevention:
# Safe - parameterized query
query = "SELECT * FROM users WHERE username = %s"
cursor.execute(query, (user_input,))Additional Defenses:
- Input validation
- Stored procedures
- Least privilege accounts
- Web application firewalls
Transparent Data Encryption protects data at rest.
TDE Architecture:
Master Key → protects → Database Encryption Key → protects → Data
TDE Implementation (SQL Server):
-- Create master key
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'StrongPassword123!';
-- Create certificate
CREATE CERTIFICATE TDECert WITH SUBJECT = 'TDE Certificate';
-- Create database encryption key
USE database;
CREATE DATABASE ENCRYPTION KEY
WITH ALGORITHM = AES_256
ENCRYPTION BY SERVER CERTIFICATE TDECert;
-- Enable encryption
ALTER DATABASE database SET ENCRYPTION ON;Column-Level Encryption:
-- PostgreSQL with pgcrypto
CREATE EXTENSION pgcrypto;
INSERT INTO users (username, ssn)
VALUES ('alice', pgp_sym_encrypt('123-45-6789', 'encryption-key'));
SELECT username,
pgp_sym_decrypt(ssn::bytea, 'encryption-key') AS ssn
FROM users;Auditing tracks database activities for security and compliance.
Audit Configuration:
-- SQL Server Audit
CREATE SERVER AUDIT SecurityAudit
TO FILE (FILEPATH = 'C:\Audit\');
CREATE DATABASE AUDIT SPECIFICATION UserActivity
FOR SERVER AUDIT SecurityAudit
ADD (SELECT, INSERT, UPDATE, DELETE ON SCHEMA::dbo BY public);
ALTER DATABASE AUDIT SPECIFICATION UserActivity WITH (STATE = ON);
-- Oracle Fine-Grained Auditing
BEGIN
DBMS_FGA.ADD_POLICY(
object_schema => 'HR',
object_name => 'EMPLOYEES',
policy_name => 'audit_salary',
audit_condition => 'salary > 100000',
audit_column => 'SALARY',
handler_schema => 'SECURITY',
handler_module => 'log_salary_access',
enable => TRUE
);
END;Audit Log Analysis:
-- Query audit logs
SELECT
event_time,
session_id,
database_principal_name,
statement
FROM sys.fn_get_audit_file('C:\Audit\*.sqlaudit', DEFAULT, DEFAULT)
WHERE event_time > '2024-01-01'
ORDER BY event_time DESC;GDPR (General Data Protection Regulation) protects EU citizens' data.
Key Principles:
- Lawful processing
- Purpose limitation
- Data minimization
- Accuracy
- Storage limitation
- Integrity and confidentiality
- Accountability
Data Subject Rights:
- Right to access
- Right to rectification
- Right to erasure (right to be forgotten)
- Right to restrict processing
- Right to data portability
- Right to object
Technical Implementation:
-- Right to be forgotten
CREATE PROCEDURE forget_user(IN user_id_param INT)
BEGIN
START TRANSACTION;
-- Anonymize personal data
UPDATE users
SET email = CONCAT('deleted-', user_id, '@anonymous.com'),
name = 'Deleted User',
phone = NULL
WHERE user_id = user_id_param;
-- Delete sensitive data from other tables
DELETE FROM user_activity_log WHERE user_id = user_id_param;
DELETE FROM marketing_preferences WHERE user_id = user_id_param;
-- Log the deletion for audit
INSERT INTO deletion_log (user_id, deletion_date, reason)
VALUES (user_id_param, NOW(), 'GDPR request');
COMMIT;
END;HIPAA (Health Insurance Portability and Accountability Act) protects health information.
Protected Health Information (PHI):
- Names
- Dates (except year)
- Phone numbers
- Email addresses
- SSN
- Medical record numbers
- Health plan numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers
- Device identifiers
- URLs/IP addresses
- Biometric identifiers
- Photos
- Any unique identifying number/code
HIPAA Technical Safeguards:
-- Access control
CREATE ROLE healthcare_provider;
CREATE ROLE billing_staff;
GRANT SELECT ON patients_demographics TO healthcare_provider;
GRANT SELECT ON patients_insurance TO billing_staff;
-- Audit logs
CREATE TABLE access_audit (
access_id INT PRIMARY KEY AUTO_INCREMENT,
user_id VARCHAR(100) NOT NULL,
patient_id INT NOT NULL,
access_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
action VARCHAR(50) NOT NULL,
reason_for_access VARCHAR(500),
ip_address VARCHAR(45)
);
-- Trigger for audit
CREATE TRIGGER audit_patient_access
AFTER SELECT ON patients_demographics
FOR EACH ROW
BEGIN
INSERT INTO access_audit (user_id, patient_id, action, ip_address)
VALUES (SESSION_USER(), NEW.patient_id, 'VIEW', CONNECTION_ID());
END;Data masking hides sensitive data from unauthorized users.
Static Data Masking: Permanent replacement of sensitive data in non-production environments.
Dynamic Data Masking: Real-time masking based on user permissions.
Dynamic Masking Examples:
-- SQL Server Dynamic Data Masking
CREATE TABLE customers (
customer_id INT PRIMARY KEY,
name VARCHAR(100) MASKED WITH (FUNCTION = 'partial(2, "XXXX", 0)'),
email VARCHAR(100) MASKED WITH (FUNCTION = 'email()'),
phone VARCHAR(20) MASKED WITH (FUNCTION = 'partial(0, "XXX-XXX-", 4)'),
ssn CHAR(11) MASKED WITH (FUNCTION = 'default()')
);
-- Oracle Data Redaction
BEGIN
DBMS_REDACT.ADD_POLICY(
object_schema => 'HR',
object_name => 'EMPLOYEES',
policy_name => 'redact_ssn',
column_name => 'SSN',
function_type => DBMS_REDACT.PARTIAL,
function_parameters => 'VVVFVVFVVVV,VVV-XX-XXXX,*,1,5',
expression => '1=1'
);
END;Security by Design Principles:
- Least Privilege: Grant minimum necessary permissions
- Defense in Depth: Multiple security layers
- Fail Secure: Default to denied
- Separation of Duties: No single user has all power
- Secure Defaults: Start secure, open only when needed
Secure Schema Design:
-- Separate sensitive data
CREATE TABLE public_user_info (
user_id INT PRIMARY KEY,
username VARCHAR(100) NOT NULL,
display_name VARCHAR(200)
);
CREATE TABLE private_user_data (
user_id INT PRIMARY KEY REFERENCES public_user_info(user_id),
email_hash CHAR(64) NOT NULL, -- Hash for lookups
email_encrypted TEXT NOT NULL, -- Encrypted for display
ssn_encrypted TEXT,
-- Additional sensitive data
);
-- Views for controlled access
CREATE VIEW user_contact AS
SELECT
u.user_id,
u.display_name,
CASE
WHEN CURRENT_ROLE() IN ('admin', 'support')
THEN pgp_sym_decrypt(p.email_encrypted, 'key')
ELSE '***REDACTED***'
END AS email
FROM public_user_info u
LEFT JOIN private_user_data p ON u.user_id = p.user_id;Security Checklist:
- Use strong authentication (MFA where possible)
- Encrypt sensitive data at rest
- Use TLS for data in transit
- Regular security patches
- Audit logging enabled
- Least privilege access
- Secure backup strategy
- Regular security testing
- Incident response plan
- Compliance monitoring
Redis is an in-memory data structure store.
Data Structures:
# Strings
SET user:1001:name "Alice"
GET user:1001:name
# Lists (queues/stacks)
LPUSH tasks "job1"
RPUSH tasks "job2"
LPOP tasks
# Sets
SADD friends:1001 1002 1003 1004
SMEMBERS friends:1001
SISMEMBER friends:1001 1002
# Hashes
HSET user:1001 name "Alice" age 30 email "alice@example.com"
HGETALL user:1001
# Sorted Sets
ZADD leaderboard 100 "player1"
ZADD leaderboard 85 "player2"
ZRANGE leaderboard 0 10 WITHSCORES
# Geospatial
GEOADD locations 13.361389 38.115556 "Palermo"
GEODIST locations "Palermo" "Catania" km
Persistence Options:
- RDB (snapshots): Point-in-time dumps
- AOF (Append Only File): Log every write
- No persistence: Pure cache
Use Cases:
- Session store
- Real-time leaderboards
- Message queues
- Rate limiting
- Caching layer
SAP HANA combines row and column stores in memory.
Features:
- Columnar storage for analytics
- Row store for OLTP
- Advanced compression
- Multi-core processing
- SQL and Graph support
Column Store Example:
-- Create column table
CREATE COLUMN TABLE sales (
id INTEGER PRIMARY KEY,
product_id INTEGER,
customer_id INTEGER,
amount DECIMAL(10,2),
sale_date DATE
);
-- In-memory processing
SELECT
product_id,
SUM(amount) AS total_sales
FROM sales
WHERE sale_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY product_id;Characteristics:
- Decentralized
- Immutable
- Append-only
- Cryptographically verified
- Consensus-based
Consensus Mechanisms:
Proof of Work (PoW):
- Solve computational puzzle
- Energy intensive
- Used by Bitcoin
Proof of Stake (PoS):
- Validators stake cryptocurrency
- Energy efficient
- Used by Ethereum 2.0
Practical Byzantine Fault Tolerance (PBFT):
- Voting-based consensus
- Permissioned blockchains
- High performance
Self-executing contracts with terms directly written in code.
Ethereum Smart Contract (Solidity):
pragma solidity ^0.8.0;
contract SimpleBank {
mapping(address => uint) private balances;
event Deposit(address indexed account, uint amount);
event Withdrawal(address indexed account, uint amount);
function deposit() public payable {
balances[msg.sender] += msg.value;
emit Deposit(msg.sender, msg.value);
}
function withdraw(uint amount) public {
require(balances[msg.sender] >= amount, "Insufficient balance");
balances[msg.sender] -= amount;
payable(msg.sender).transfer(amount);
emit Withdrawal(msg.sender, amount);
}
function getBalance() public view returns (uint) {
return balances[msg.sender];
}
}| Aspect | Traditional DB | Blockchain |
|---|---|---|
| Control | Centralized | Decentralized |
| Write | Update/Delete | Append-only |
| Trust | Trust administrator | Trust math/cryptography |
| Performance | High (thousands TPS) | Low (tens TPS) |
| Cost | Operational | Transaction fees |
| Use Case | General purpose | Trustless transactions |
Vector databases store and query embeddings for AI applications.
Characteristics:
- Store high-dimensional vectors
- Efficient similarity search
- Indexing for nearest neighbors
- Integration with ML models
Popular Vector Databases:
- Pinecone
- Weaviate
- Milvus
- Qdrant
Example (Pinecone):
import pinecone
# Initialize
pinecone.init(api_key="your-api-key")
index = pinecone.Index("embeddings")
# Insert vectors
vectors = [
("id1", [0.1, 0.2, 0.3, ...], {"metadata": "text1"}),
("id2", [0.4, 0.5, 0.6, ...], {"metadata": "text2"})
]
index.upsert(vectors)
# Query similar vectors
results = index.query(
vector=[0.15, 0.25, 0.35, ...],
top_k=5,
include_metadata=True
)Store embeddings alongside traditional data.
PostgreSQL with pgvector:
-- Enable extension
CREATE EXTENSION vector;
-- Create table with embeddings
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536) -- OpenAI embedding dimension
);
-- Create index for similarity search
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Insert with embedding
INSERT INTO documents (content, embedding)
VALUES ('Sample text', '[0.1,0.2,0.3,...]');
-- Similarity search
SELECT content, 1 - (embedding <=> '[0.15,0.25,0.35,...]') AS similarity
FROM documents
ORDER BY embedding <=> '[0.15,0.25,0.35,...]'
LIMIT 10;Distance Metrics:
- Cosine similarity: Angle between vectors
- Euclidean distance: Straight-line distance
- Dot product: Projection similarity
Approximate Nearest Neighbor (ANN) Algorithms:
- HNSW (Hierarchical Navigable Small World): Graph-based
- IVF (Inverted File Index): Cluster-based
- PQ (Product Quantization): Compression-based
Machine learning improves query optimization.
Learned Indexes: Replace traditional indexes with neural networks that learn data distribution.
Query Prediction: Predict query patterns to pre-optimize.
Cardinality Estimation: ML models estimate result sizes more accurately.
Example - Learned Cardinality:
-- Traditional: based on statistics
Estimated: 10,000 rows
Actual: 15,000 rows
-- ML model: learns from query history
Estimated: 14,800 rows
Actual: 15,000 rows -- Much closer!Identifying Slow Queries:
-- PostgreSQL slow query log
log_min_duration_statement = 1000 -- Log queries > 1 second
-- MySQL slow query log
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 1;
-- Find current slow queries
SELECT * FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '1 minute';
-- Analyze query performance
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM large_table WHERE condition;Index Maintenance:
-- Find unused indexes
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0;
-- Find duplicate indexes
SELECT pg_size_pretty(SUM(pg_relation_size(idx))::BIGINT) AS total,
array_agg(idx) AS indexes
FROM (
SELECT indexrelid::regclass AS idx,
indrelid,
array_agg(attname) AS cols
FROM pg_index i
JOIN pg_attribute a ON a.attrelid = i.indrelid
AND a.attnum = ANY(i.indkey)
WHERE i.indisunique = false
GROUP BY i.indexrelid, i.indrelid, i.indkey
) t
GROUP BY indrelid, cols
HAVING COUNT(*) > 1;
-- Rebuild fragmented indexes
REINDEX INDEX idx_name;Database Caching:
-- PostgreSQL shared buffers
SHOW shared_buffers; -- Typically 25% of RAM
-- MySQL InnoDB buffer pool
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
-- Query cache (deprecated in modern versions)
-- Use application-level caching insteadApplication-Level Caching (Redis):
import redis
import json
cache = redis.Redis(host='localhost', port=6379, db=0)
def get_user(user_id):
# Try cache first
cached = cache.get(f"user:{user_id}")
if cached:
return json.loads(cached)
# Cache miss - query database
user = db.query("SELECT * FROM users WHERE id = %s", user_id)
# Store in cache (expires in 1 hour)
cache.setex(f"user:{user_id}", 3600, json.dumps(user))
return userVertical Scaling (Scale Up):
- More CPU, RAM, faster storage
- Simpler, no application changes
- Hardware limits
Horizontal Scaling (Scale Out):
Read Replicas:
-- Master handles writes
-- Replicas handle reads
SELECT * FROM orders; -- Goes to replica
INSERT INTO orders ...; -- Goes to masterSharding:
Application-level sharding:
def get_shard(user_id):
shard_id = user_id % 4
return shard_connections[shard_id]Database-level sharding (CockroachDB):
-- Automatic sharding
SHOW RANGES FROM TABLE users;Connection Pooling:
# SQLAlchemy connection pool
engine = create_engine(
'postgresql://user:pass@localhost/db',
pool_size=20,
max_overflow=10,
pool_pre_ping=True
)Principles:
- Version control for schema
- Automated migrations
- Testing in CI pipeline
- Rollback capability
- Zero-downtime deployments
Migration Tools:
- Flyway
- Liquibase
- Alembic (Python)
- ActiveRecord Migrations (Rails)
Flyway Example:
-- V1__create_users_table.sql
CREATE TABLE users (
id INT PRIMARY KEY AUTO_INCREMENT,
email VARCHAR(200) NOT NULL UNIQUE,
name VARCHAR(200) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- V2__add_password_hash.sql
ALTER TABLE users ADD COLUMN password_hash CHAR(60) NOT NULL;
-- V3__create_orders_table.sql
CREATE TABLE orders (
id INT PRIMARY KEY AUTO_INCREMENT,
user_id INT NOT NULL,
amount DECIMAL(10,2) NOT NULL,
status VARCHAR(50) DEFAULT 'pending',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (user_id) REFERENCES users(id)
);Migration Commands:
# Apply migrations
flyway migrate
# Check status
flyway info
# Validate checksums
flyway validate
# Repair failed migration
flyway repairAutomated Backup Script:
#!/bin/bash
# daily_backup.sh
BACKUP_DIR="/backup/$(date +%Y-%m-%d)"
mkdir -p $BACKUP_DIR
# Full database backup
pg_dump -h localhost -U dbadmin mydb > $BACKUP_DIR/full.sql
# Archive and compress
tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR
# Upload to cloud storage
aws s3 cp $BACKUP_DIR.tar.gz s3://my-backups/
# Clean old backups (keep 30 days)
find /backup -type d -mtime +30 -exec rm -rf {} \;
# Test backup integrity
pg_restore --list $BACKUP_DIR/full.sql > /dev/null
if [ $? -eq 0 ]; then
echo "Backup valid" | mail -s "Backup Success" dba@example.com
else
echo "Backup corrupted" | mail -s "BACKUP FAILED" dba@example.com
fiPrometheus + Grafana Setup:
# prometheus.yml
scrape_configs:
- job_name: 'postgresql'
static_configs:
- targets: ['localhost:9187'] # PostgreSQL exporter
- job_name: 'mysql'
static_configs:
- targets: ['localhost:9104'] # MySQL exporterKey Metrics:
- Query latency (p95, p99)
- Connection count
- Cache hit ratio
- Replication lag
- Disk I/O
- CPU/Memory usage
Alerting Rules:
groups:
- name: database_alerts
rules:
- alert: HighQueryLatency
expr: pg_stat_activity_max_tx_duration > 300
for: 5m
annotations:
summary: "High query latency detected"
- alert: LowCacheHitRatio
expr: (pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read)) < 0.95
for: 10m
annotations:
summary: "Cache hit ratio below 95%"Requirements:
- 100% ACID compliance
- Zero data loss
- High availability (99.999%)
- Regulatory compliance
- Fraud detection
- Real-time balances
Architecture:
Frontend → Load Balancer → Application Servers → Database Cluster
↓ ↓
Cache Layer Read Replicas
(Redis) (Reporting)
Database Design:
-- Account table with optimistic locking
CREATE TABLE accounts (
account_id BIGINT PRIMARY KEY,
account_number VARCHAR(20) UNIQUE NOT NULL,
customer_id BIGINT NOT NULL,
account_type VARCHAR(20) NOT NULL,
balance DECIMAL(15,2) NOT NULL DEFAULT 0.00,
currency CHAR(3) NOT NULL DEFAULT 'USD',
status VARCHAR(20) NOT NULL DEFAULT 'active',
version INT NOT NULL DEFAULT 0, -- For optimistic locking
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Transaction table (append-only)
CREATE TABLE transactions (
transaction_id BIGINT PRIMARY KEY,
account_id BIGINT NOT NULL REFERENCES accounts(account_id),
transaction_type VARCHAR(20) NOT NULL,
amount DECIMAL(15,2) NOT NULL,
balance_after DECIMAL(15,2) NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending',
reference_number VARCHAR(50) UNIQUE,
description VARCHAR(500),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_account_created (account_id, created_at DESC)
) PARTITION BY RANGE (YEAR(created_at));
-- Transfer funds stored procedure with locking
DELIMITER //
CREATE PROCEDURE TransferFunds(
IN from_account BIGINT,
IN to_account BIGINT,
IN transfer_amount DECIMAL(15,2),
IN reference VARCHAR(50)
)
BEGIN
DECLARE EXIT HANDLER FOR SQLEXCEPTION
BEGIN
ROLLBACK;
RESIGNAL;
END;
START TRANSACTION;
-- Lock accounts in consistent order to prevent deadlocks
IF from_account < to_account THEN
SELECT balance INTO @balance FROM accounts
WHERE account_id = from_account FOR UPDATE;
SELECT balance INTO @balance FROM accounts
WHERE account_id = to_account FOR UPDATE;
ELSE
SELECT balance INTO @balance FROM accounts
WHERE account_id = to_account FOR UPDATE;
SELECT balance INTO @balance FROM accounts
WHERE account_id = from_account FOR UPDATE;
END IF;
-- Check sufficient funds
SELECT balance INTO @from_balance FROM accounts
WHERE account_id = from_account;
IF @from_balance < transfer_amount THEN
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'Insufficient funds';
END IF;
-- Update balances
UPDATE accounts
SET balance = balance - transfer_amount,
version = version + 1
WHERE account_id = from_account;
UPDATE accounts
SET balance = balance + transfer_amount,
version = version + 1
WHERE account_id = to_account;
-- Record transactions
INSERT INTO transactions (
account_id, transaction_type, amount,
balance_after, reference_number, description
) VALUES (
from_account, 'TRANSFER_OUT', -transfer_amount,
(SELECT balance FROM accounts WHERE account_id = from_account),
reference, CONCAT('Transfer to ', to_account)
);
INSERT INTO transactions (
account_id, transaction_type, amount,
balance_after, reference_number, description
) VALUES (
to_account, 'TRANSFER_IN', transfer_amount,
(SELECT balance FROM accounts WHERE account_id = to_account),
reference, CONCAT('Transfer from ', from_account)
);
COMMIT;
END//
DELIMITER ;High Availability Setup:
-- Synchronous replication
-- Primary and standby with automatic failover
-- Primary configuration (postgresql.conf)
synchronous_commit = on
synchronous_standby_names = 'standby1'
-- Standby configuration (recovery.conf)
primary_conninfo = 'host=primary port=5432 user=replicator'
recovery_target_timeline = 'latest'Requirements:
- 1B+ users
- 100M+ daily active users
- Real-time feeds
- Graph relationships
- High write throughput
- Eventually consistent reads
Database Mix:
- User data: PostgreSQL (sharded)
- Posts: Cassandra (wide rows)
- Relationships: Neo4j (graph)
- Feeds: Redis (timelines)
- Activity logs: HBase (time-series)
Post Storage (Cassandra):
CREATE TABLE posts (
user_id UUID,
post_id TIMEUUID,
content TEXT,
media_urls LIST<TEXT>,
likes_count INT,
comments_count INT,
shares_count INT,
created_at TIMESTAMP,
PRIMARY KEY (user_id, created_at, post_id)
) WITH CLUSTERING ORDER BY (created_at DESC, post_id DESC);
-- Get user's posts
SELECT * FROM posts
WHERE user_id = ?
LIMIT 20;
-- Time-based partitioning
CREATE TABLE posts_by_time (
bucket INT, -- e.g., YYYYMMDD
post_id TIMEUUID,
user_id UUID,
content TEXT,
PRIMARY KEY (bucket, post_id)
);Feed Generation:
def generate_feed(user_id, limit=50):
# Get followed users from graph DB
followed = graph.run("""
MATCH (u:User {id: $user_id})-[:FOLLOWS]->(f:User)
RETURN f.id
""", user_id=user_id).values()
# Get recent posts from Cassandra
posts = []
for followed_id in followed:
recent = session.execute("""
SELECT * FROM posts
WHERE user_id = %s
ORDER BY created_at DESC
LIMIT 10
""", [followed_id])
posts.extend(recent)
# Sort by time and cache in Redis
posts.sort(key=lambda x: x.created_at, reverse=True)
redis_client.setex(
f"feed:{user_id}",
300, # 5 minute cache
json.dumps(posts[:limit])
)
return posts[:limit]Requirements:
- Product catalog
- Shopping cart
- Order processing
- Inventory management
- Recommendation engine
- Seasonal spikes (Black Friday)
Key Challenges:
- Handle traffic spikes
- Prevent overselling
- Real-time inventory
- Personalized recommendations
Inventory Management:
-- Inventory table with optimistic locking
CREATE TABLE inventory (
product_id INT PRIMARY KEY,
sku VARCHAR(50) UNIQUE NOT NULL,
quantity INT NOT NULL CHECK (quantity >= 0),
reserved INT NOT NULL DEFAULT 0,
reorder_point INT,
version INT NOT NULL DEFAULT 0,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Reserve inventory with optimistic locking
CREATE PROCEDURE ReserveInventory(
IN p_product_id INT,
IN p_quantity INT,
IN p_order_id INT
)
BEGIN
DECLARE v_available INT;
DECLARE v_version INT;
DECLARE EXIT HANDLER FOR SQLEXCEPTION
BEGIN
ROLLBACK;
RESIGNAL;
END;
START TRANSACTION;
-- Check availability
SELECT quantity - reserved, version
INTO v_available, v_version
FROM inventory
WHERE product_id = p_product_id
FOR UPDATE;
IF v_available < p_quantity THEN
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'Insufficient inventory';
END IF;
-- Update with version check
UPDATE inventory
SET reserved = reserved + p_quantity,
version = version + 1
WHERE product_id = p_product_id
AND version = v_version;
-- Record reservation
INSERT INTO inventory_reservations (
order_id, product_id, quantity, reserved_at
) VALUES (
p_order_id, p_product_id, p_quantity, NOW()
);
COMMIT;
END;Order Processing with Saga Pattern:
def create_order(user_id, items):
try:
# Step 1: Create order record
order_id = create_order_record(user_id, items)
# Step 2: Reserve inventory for each item
for item in items:
reserve_inventory(item.product_id, item.quantity, order_id)
# Step 3: Process payment
payment_id = process_payment(user_id, calculate_total(items))
# Step 4: Update order status
confirm_order(order_id, payment_id)
# Step 5: Trigger fulfillment
queue_fulfillment(order_id)
return order_id
except Exception as e:
# Compensation: Rollback completed steps
if 'order_id' in locals():
cancel_order(order_id)
if 'payment_id' in locals():
refund_payment(payment_id)
raiseRequirements:
- Microsecond latency
- 100% data accuracy
- No downtime
- Complex risk checks
- Regulatory reporting
- Audit trails
Architecture:
Market Data → Order Gateway → Risk Engine → Order Manager → Exchange
↑ ↓
Client API Trade Database
Database Design:
-- In-memory tables for speed
CREATE TABLE positions (
account_id INT,
symbol VARCHAR(10),
quantity INT,
avg_price DECIMAL(10,4),
last_updated TIMESTAMP,
PRIMARY KEY (account_id, symbol)
) ENGINE=MEMORY;
-- Persistent storage for compliance
CREATE TABLE trades (
trade_id BIGINT PRIMARY KEY,
order_id BIGINT NOT NULL,
account_id INT NOT NULL,
symbol VARCHAR(10) NOT NULL,
side ENUM('BUY', 'SELL') NOT NULL,
quantity INT NOT NULL,
price DECIMAL(10,4) NOT NULL,
trade_time TIMESTAMP(6) NOT NULL, -- Microsecond precision
exchange VARCHAR(50) NOT NULL,
compliance_check BOOLEAN DEFAULT FALSE,
INDEX idx_account_time (account_id, trade_time DESC)
) PARTITION BY RANGE (TO_DAYS(trade_time));
-- Risk limits
CREATE TABLE risk_limits (
account_id INT PRIMARY KEY,
max_position_size INT,
max_daily_loss DECIMAL(10,2),
max_order_value DECIMAL(10,2),
allowed_symbols JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);Order Validation:
CREATE PROCEDURE ValidateOrder(
IN p_account_id INT,
IN p_symbol VARCHAR(10),
IN p_side VARCHAR(4),
IN p_quantity INT,
IN p_price DECIMAL(10,4)
)
BEGIN
DECLARE v_current_position INT;
DECLARE v_max_position INT;
DECLARE v_daily_loss DECIMAL(10,2);
DECLARE v_allowed BOOLEAN;
-- Check risk limits (in microseconds)
SELECT max_position_size, max_daily_loss, allowed_symbols
INTO v_max_position, v_daily_loss, v_allowed
FROM risk_limits
WHERE account_id = p_account_id
FOR UPDATE; -- Lock to prevent concurrent overrides
-- Check symbol allowed
IF NOT JSON_CONTAINS(v_allowed, JSON_QUOTE(p_symbol)) THEN
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'Symbol not allowed';
END IF;
-- Check position limits
SELECT SUM(IF(side='BUY', quantity, -quantity))
INTO v_current_position
FROM trades
WHERE account_id = p_account_id
AND symbol = p_symbol
AND trade_time >= DATE_SUB(NOW(), INTERVAL 1 DAY);
IF ABS(v_current_position +
IF(p_side='BUY', p_quantity, -p_quantity)) > v_max_position THEN
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'Position limit exceeded';
END IF;
-- Log validation
INSERT INTO order_validation_log (
account_id, symbol, side, quantity, price,
validation_time, validation_result
) VALUES (
p_account_id, p_symbol, p_side, p_quantity, p_price,
NOW(6), 'PASSED'
);
END;Proof of Equivalence: σ_cond1(σ_cond2(R)) = σ_cond2(σ_cond1(R))
Let t be a tuple in σ_cond1(σ_cond2(R))
⇔ t ∈ R and cond2(t) true and cond1(t) true
⇔ t ∈ R and cond1(t) true and cond2(t) true
⇔ t ∈ σ_cond2(σ_cond1(R))
Therefore, selection operations commute.
Proof of Push Selection through Join: σ_cond(R ⋈ S) = σ_cond(R) ⋈ S, if cond references only R
Let t be a tuple in σ_cond(R ⋈ S)
⇔ ∃ r ∈ R, s ∈ S such that t = r ∪ s and cond(t) true
⇔ ∃ r ∈ σ_cond(R), s ∈ S such that t = r ∪ s
⇔ t ∈ σ_cond(R) ⋈ S
Therefore, selection pushes through join when condition references only one relation.
Common SQL Functions:
| Function | Description | Example |
|---|---|---|
| COUNT(*) | Count rows | SELECT COUNT(*) FROM users |
| SUM(col) | Sum values | SELECT SUM(amount) FROM orders |
| AVG(col) | Average | SELECT AVG(price) FROM products |
| MIN(col) | Minimum | SELECT MIN(salary) FROM employees |
| MAX(col) | Maximum | SELECT MAX(age) FROM users |
| CONCAT | String join | SELECT CONCAT(first, ' ', last) |
| SUBSTRING | Extract part | SELECT SUBSTRING(email, 1, 5) |
| COALESCE | First non-null | SELECT COALESCE(phone, email) |
| DATE_ADD | Add to date | SELECT DATE_ADD(date, INTERVAL 1 DAY) |
| DATEDIFF | Date difference | SELECT DATEDIFF(date1, date2) |
Common SQL Patterns:
-- Pagination
SELECT * FROM users
ORDER BY id
LIMIT 10 OFFSET 20;
-- Insert or update
INSERT INTO users (id, name) VALUES (1, 'Alice')
ON DUPLICATE KEY UPDATE name = VALUES(name);
-- Find duplicates
SELECT email, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
-- Running total
SELECT
date,
amount,
SUM(amount) OVER (ORDER BY date) AS running_total
FROM sales;
-- Top N per group
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees
) ranked
WHERE rn <= 3;Project 1: Library Management System
CREATE DATABASE library;
USE library;
CREATE TABLE books (
book_id INT PRIMARY KEY AUTO_INCREMENT,
isbn VARCHAR(20) UNIQUE NOT NULL,
title VARCHAR(500) NOT NULL,
author VARCHAR(200) NOT NULL,
publisher VARCHAR(200),
publication_year INT,
category VARCHAR(100),
total_copies INT NOT NULL DEFAULT 1,
available_copies INT NOT NULL DEFAULT 1,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE members (
member_id INT PRIMARY KEY AUTO_INCREMENT,
card_number VARCHAR(20) UNIQUE NOT NULL,
first_name VARCHAR(100) NOT NULL,
last_name VARCHAR(100) NOT NULL,
email VARCHAR(200) UNIQUE NOT NULL,
phone VARCHAR(20),
address TEXT,
membership_date DATE NOT NULL,
membership_expiry DATE NOT NULL,
status VARCHAR(20) DEFAULT 'active'
);
CREATE TABLE loans (
loan_id INT PRIMARY KEY AUTO_INCREMENT,
book_id INT NOT NULL REFERENCES books(book_id),
member_id INT NOT NULL REFERENCES members(member_id),
loan_date DATE NOT NULL,
due_date DATE NOT NULL,
return_date DATE,
status VARCHAR(20) DEFAULT 'borrowed',
fine_amount DECIMAL(5,2) DEFAULT 0.00,
INDEX idx_dates (due_date, status)
);
CREATE TABLE fines (
fine_id INT PRIMARY KEY AUTO_INCREMENT,
loan_id INT NOT NULL REFERENCES loans(loan_id),
amount DECIMAL(5,2) NOT NULL,
paid BOOLEAN DEFAULT FALSE,
paid_date DATE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Stored procedure for borrowing
DELIMITER //
CREATE PROCEDURE BorrowBook(
IN p_book_id INT,
IN p_member_id INT
)
BEGIN
DECLARE v_available INT;
DECLARE v_member_status VARCHAR(20);
START TRANSACTION;
-- Check member status
SELECT status INTO v_member_status
FROM members WHERE member_id = p_member_id;
IF v_member_status != 'active' THEN
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'Member not active';
END IF;
-- Check availability
SELECT available_copies INTO v_available
FROM books WHERE book_id = p_book_id
FOR UPDATE;
IF v_available < 1 THEN
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'No copies available';
END IF;
-- Update book count
UPDATE books
SET available_copies = available_copies - 1
WHERE book_id = p_book_id;
-- Create loan
INSERT INTO loans (book_id, member_id, loan_date, due_date)
VALUES (p_book_id, p_member_id, CURDATE(), DATE_ADD(CURDATE(), INTERVAL 14 DAY));
COMMIT;
END//
DELIMITER ;Basic Level:
- What is the difference between DELETE and TRUNCATE?
- Explain primary key vs unique key.
- What are the ACID properties?
- Write a query to find second highest salary.
- Explain different types of JOINs.
Intermediate Level:
- What is normalization? Explain 1NF, 2NF, 3NF.
- How do indexes work? When should you use them?
- Explain the difference between clustered and non-clustered indexes.
- What is a deadlock? How can you prevent it?
- Write a query to find employees earning more than their managers.
Advanced Level:
- Explain MVCC and how it enables concurrency.
- How would you design a database for a social media platform?
- What's the difference between vertical and horizontal partitioning?
- Explain the CAP theorem and its implications for distributed databases.
- How would you handle a database with billions of rows?
Practical Scenarios:
- Your database is slow. How do you diagnose and fix it?
- How do you handle a schema migration with zero downtime?
- Design a system to handle Black Friday traffic spikes.
- How do you ensure GDPR compliance in your database?
- Your database crashed at 2 AM. What do you do?
Classic Papers:
- Codd, E.F. (1970). "A Relational Model of Data for Large Shared Data Banks"
- Gray, J. et al. (1981). "The Recovery Manager of the System R Database Manager"
- Lamport, L. (1978). "Time, Clocks, and the Ordering of Events in a Distributed System"
- Stonebraker, M. et al. (1976). "The Design and Implementation of INGRES"
Modern Papers:
- DeCandia, G. et al. (2007). "Dynamo: Amazon's Highly Available Key-value Store"
- Chang, F. et al. (2006). "Bigtable: A Distributed Storage System for Structured Data"
- Corbett, J.C. et al. (2012). "Spanner: Google's Globally-Distributed Database"
- Lakshman, A. & Malik, P. (2010). "Cassandra: A Decentralized Structured Storage System"
Books:
- "Database System Concepts" - Silberschatz, Korth, Sudarshan
- "Designing Data-Intensive Applications" - Martin Kleppmann
- "SQL Performance Explained" - Markus Winand
- "Database Internals" - Alex Petrov
- "Readings in Database Systems" (The Red Book) - Peter Bailis et al.
Online Resources:
- Use The Index, Luke! (use-the-index-luke.com)
- High Scalability (highscalability.com)
- PostgreSQL Documentation
- MySQL Performance Blog
- DB-Engines Ranking (db-engines.com)