Skip to content

Instantly share code, notes, and snippets.

@allen-munsch
Created February 27, 2025 17:00
Show Gist options
  • Save allen-munsch/9b49e1996f21a5da738e2456b0c32f8a to your computer and use it in GitHub Desktop.
Save allen-munsch/9b49e1996f21a5da738e2456b0c32f8a to your computer and use it in GitHub Desktop.

Software Engineering Interview Comprehensive Cheatsheet

Interview 1 - Pair Programming Analysis

Thinking Aloud Techniques

  • Verbalize your thought process: "I'm thinking we should check for edge cases here..."
  • Explain trade-offs: "We could use a hash map for O(1) lookup, but it would cost O(n) space..."
  • Articulate assumptions: "I'm assuming the input array is not sorted..."
  • Narrate your debugging process: "Let's trace through with this example input..."

Code Optimization Patterns

Time/Space Complexity Improvements

# Before: O(n²) time complexity
def find_duplicate(arr):
    for i in range(len(arr)):
        for j in range(i+1, len(arr)):
            if arr[i] == arr[j]:
                return arr[i]
    return -1

# After: O(n) time complexity, O(n) space complexity
def find_duplicate_optimized(arr):
    seen = set()
    for num in arr:
        if num in seen:
            return num
        seen.add(num)
    return -1

# Further optimization: Floyd's Tortoise and Hare (for specific cases)
def find_duplicate_cycle(arr):
    slow = fast = arr[0]
    while True:
        slow = arr[slow]
        fast = arr[arr[fast]]
        if slow == fast:
            break
            
    slow = arr[0]
    while slow != fast:
        slow = arr[slow]
        fast = arr[fast]
    return slow

Code Readability & Maintainability

# Before
def p(d, s):
    r = []
    for i in range(len(d)):
        if d[i] > s:
            r.append(i)
    return r

# After
def find_elements_greater_than_threshold(data, threshold):
    """Returns indices of elements that exceed the given threshold."""
    result_indices = []
    for index, value in enumerate(data):
        if value > threshold:
            result_indices.append(index)
    return result_indices

Error Handling & Edge Cases

# Poor error handling
def divide(a, b):
    return a / b  # Will crash if b = 0

# Better error handling
def divide_safely(a, b):
    if b == 0:
        return None  # Or raise a custom exception
    return a / b

# Comprehensive error handling
def divide_with_validation(a, b):
    try:
        if not isinstance(a, (int, float)) or not isinstance(b, (int, float)):
            raise TypeError("Both inputs must be numbers")
        if b == 0:
            raise ValueError("Cannot divide by zero")
        return a / b
    except Exception as e:
        logging.error(f"Error in division: {e}")
        # Handle appropriately or re-raise
        raise

Testing Strategies

# Unit testing example
def test_find_duplicate():
    # Test normal case
    assert find_duplicate([1, 2, 3, 2, 4]) == 2
    
    # Test edge cases
    assert find_duplicate([]) == -1
    assert find_duplicate([1]) == -1
    
    # Test with no duplicates
    assert find_duplicate([1, 2, 3, 4]) == -1
    
    # Performance test with large input
    large_input = list(range(10000)) + [5]
    assert find_duplicate(large_input) == 5

Improving Existing Codebases

  • Examples to mention:
    • "I refactored a monolithic function into smaller, testable units reducing bug rate by 40%"
    • "I introduced caching to a frequently called API endpoint, reducing response time by 300ms"
    • "I standardized error handling across the service, improving debuggability"
    • "I added comprehensive logging, enabling us to identify a subtle race condition"

Interview 2 - Architecture Design

Scalability Patterns

Horizontal vs Vertical Scaling

  • Vertical Scaling: Upgrading existing machines (more CPU, RAM)
    • ✅ Simple to implement
    • ❌ Hardware limits, single point of failure
  • Horizontal Scaling: Adding more machines
    • ✅ Theoretically unlimited scaling, fault tolerance
    • ❌ More complex architecture, data consistency challenges

Load Balancing

Client → [ Load Balancer ] → Server 1
                           → Server 2
                           → Server 3
  • Strategies: Round robin, least connections, IP hash
  • Sticky Sessions: Route user to same server for session duration
  • Health Checks: Automatically remove unhealthy servers

Distributed Systems Concepts

CAP Theorem

  • Consistency: All nodes see the same data at the same time
  • Availability: Every request receives a response
  • Partition Tolerance: System continues to operate despite network failures
  • You can only guarantee two of three
CA: Traditional RDBMS (PostgreSQL, MySQL)
CP: MongoDB, HBase, Redis
AP: Cassandra, DynamoDB, CouchDB

Consistency Models

  • Strong Consistency: All reads reflect all prior writes
  • Eventual Consistency: Given enough time, all replicas will converge
  • Causal Consistency: Operations causally related must be seen in same order
  • Session Consistency: Client sees its own updates

Failure Handling

  • Circuit Breaker Pattern:
// Pseudo-code
CircuitBreaker breaker = new CircuitBreaker(threshold=5, timeout=60s);

public Response callService() {
    if (breaker.isOpen()) {
        return fallbackResponse();
    }
    
    try {
        Response response = service.call();
        breaker.recordSuccess();
        return response;
    } catch (Exception e) {
        breaker.recordFailure();
        return fallbackResponse();
    }
}

Microservices vs Monoliths

Monolith Architecture

┌───────────────────────────────┐
│            Monolith           │
│ ┌─────────┐ ┌─────────────┐   │
│ │   UI    │ │  Business   │   │
│ │         │ │   Logic     │   │
│ └─────────┘ └─────────────┘   │
│ ┌──────────────────────────┐  │
│ │         Database         │  │
│ └──────────────────────────┘  │
└───────────────────────────────┘

Microservices Architecture

┌───────────┐  ┌───────────┐  ┌───────────┐
│  Service A │  │ Service B │  │ Service C │
│ ┌─────────┐│  │┌─────────┐│  │┌─────────┐│
│ │ Business││  ││ Business││  ││ Business││
│ │  Logic  ││  ││  Logic  ││  ││  Logic  ││
│ └─────────┘│  │└─────────┘│  │└─────────┘│
│ ┌─────────┐│  │┌─────────┐│  │┌─────────┐│
│ │   DB    ││  ││   DB    ││  ││   DB    ││
│ └─────────┘│  │└─────────┘│  │└─────────┘│
└───────────┘  └───────────┘  └───────────┘

Trade-offs

  • Monoliths
    • ✅ Simpler development, deployment, debugging
    • ✅ Better performance (no network calls)
    • ❌ Scaling challenges, technology lock-in
  • Microservices
    • ✅ Independent scaling and deployment
    • ✅ Technology diversity, team autonomy
    • ❌ Distributed system complexity
    • ❌ Service boundary design challenges

Data Storage Solutions

SQL vs NoSQL

SQL (Relational) NoSQL
Structured data Varied data structures
ACID transactions BASE (Basically Available, Soft state, Eventually consistent)
Rigid schema Flexible schema
Vertical scaling Horizontal scaling
Examples: PostgreSQL, MySQL Examples: MongoDB, Cassandra, Redis

Caching Strategies

# Cache-Aside Pattern
def get_user(user_id):
    # Try to get from cache
    user = cache.get(f"user:{user_id}")
    if user is not None:
        return user
        
    # Cache miss, get from database
    user = db.query(f"SELECT * FROM users WHERE id = {user_id}")
    
    # Update cache
    cache.set(f"user:{user_id}", user, expiry=3600)
    return user
  • Cache Invalidation Strategies:
    • TTL (Time-To-Live)
    • Write-through (update cache when writing to DB)
    • Write-behind (async update to DB after cache write)
    • Write-around (write to DB, let cache expire naturally)

Message Queues & Event-Driven Architectures

Message Queue Pattern

Producer → [Queue] → Consumer
  • Benefits:
    • Decoupling producers from consumers
    • Buffer for traffic spikes
    • Ensuring message delivery
    • Enabling parallel processing

Event-Driven Architecture

               ┌─────────────┐
               │Event Source │
               └──────┬──────┘
                      │
               ┌──────▼──────┐
               │Event Channel│
               └──────┬──────┘
                      │
        ┌─────────────┼────────────┐
        │             │            │
┌───────▼─────┐ ┌─────▼───────┐ ┌─▼────────────┐
│Consumer A   │ │Consumer B   │ │Consumer C    │
└─────────────┘ └─────────────┘ └──────────────┘
  • Examples: Kafka, RabbitMQ, AWS SQS/SNS, Google Pub/Sub

Interview 3 - Technical Skills & Culture

Leading Complex Technical Projects

STAR Method Framework

  • Situation: "We needed to migrate our monolith to microservices"
  • Task: "I was tasked with leading the authentication service migration"
  • Action: "I created a phased migration plan, introduced feature flags for safe rollbacks, and implemented parallel running with traffic mirroring"
  • Result: "We completed the migration with zero downtime and improved auth response times by 40%"

Project Leadership Examples

  1. Technical Debt Reduction: "Reduced CI pipeline times from 45 to 12 minutes"
  2. Architecture Evolution: "Led transition from batch processing to streaming architecture"
  3. Performance Optimization: "Spearheaded database optimization reducing query times by 85%"

Mentoring Other Engineers

Mentoring Approaches

  • Pair Programming: Regular sessions focused on specific skills
  • Code Reviews: Thoughtful feedback with explanations and resources
  • Architecture Decision Records: Documenting decisions and reasoning
  • Tech Talks: Organizing knowledge sharing sessions

Success Stories

  • "Mentored junior engineer who became team lead in 18 months"
  • "Established onboarding process reducing time-to-productivity by 40%"
  • "Created internal documentation that became company standard"

Making Technical Decisions

Decision Framework

1. Clearly state the problem
2. List constraints (time, budget, people, tech)
3. Identify 2-3 viable options
4. Analyze trade-offs (performance, scalability, maintainability)
5. Make recommendation with rationale
6. Define success metrics

Example

  • Decision: "Move from self-hosted to managed Kubernetes"
  • Business Impact: "Reduced ops overhead by 30%, enabling team to focus on product features"
  • Technical Impact: "Improved reliability from 99.9% to 99.99% uptime"

Handling Technical Disagreements

Disagreement Resolution Process

1. Focus on shared goals
2. Base discussion on data when possible
3. Separate facts from opinions
4. Create pros/cons list for each approach
5. Consider short-term vs long-term implications
6. Agree on decision-making framework
7. Commit to decision once made

Example Scenario

  • "Team disagreed on GraphQL vs REST API approach"
  • "I organized a spike comparing performance, developer experience, and client needs"
  • "Data showed GraphQL improved mobile app performance significantly"
  • "We agreed on GraphQL with performance monitoring in place"

Cross-Team Collaboration

Collaboration Techniques

  • Interface Contracts: Clearly defined API contracts
  • RFC Process: Request For Comments documentation
  • Joint Planning: Cross-team planning sessions
  • Embedded Team Members: Temporary cross-team assignments

Example

  • "Partnered with data science team to implement ML feature"
  • "Created shared roadmap with clear responsibilities"
  • "Established weekly sync meetings focusing on integration points"
  • "Result: Successfully launched feature ahead of schedule"

Questions to Ask

Technical Challenges

  • "What's the most interesting technical challenge the team has solved recently?"
  • "How do you balance technical debt against new feature development?"
  • "What's your approach to keeping technology stack current?"

Decision Making

  • "How are architecture decisions made and documented?"
  • "How do you handle technical disagreements within the team?"
  • "Can you walk me through a recent significant technical decision?"

Team Growth

  • "How do you approach mentorship and career growth?"
  • "What skills are you looking to develop in the team?"
  • "How do you measure engineering effectiveness?"

Interview 4 - SQL Assessment

PostgreSQL-Specific Syntax

Data Types

-- PostgreSQL-specific data types
CREATE TABLE example (
    id SERIAL PRIMARY KEY,
    text_data TEXT,
    json_data JSONB,
    array_data INTEGER[],
    range_data INT4RANGE,
    enum_field my_enum_type,
    ip_address INET
);

JSON Operations

-- JSON creation and manipulation
SELECT 
    -- Create JSON object
    jsonb_build_object(
        'name', customer_name,
        'email', email,
        'orders', order_count
    ) AS customer_json,
    
    -- Extract value from JSON (multiple ways)
    customer_data->'address'->'city' AS city,
    customer_data->>'country' AS country, -- Extract as text
    jsonb_path_query(customer_data, '$.preferences[*] ? (@.type == "primary")') AS primary_prefs,
    
    -- Check for key existence
    customer_data ? 'premium' AS is_premium,
    
    -- Update JSON values
    jsonb_set(
        customer_data, 
        '{last_login}', 
        to_jsonb(CURRENT_TIMESTAMP)
    ) AS updated_data
FROM 
    customers;

Performance:

  • Use JSONB (not JSON) for better performance and indexing
  • GIN indexes can dramatically speed up JSON queries: CREATE INDEX idx_customer_data ON customers USING GIN (customer_data);
  • Use ->>'key' operator (returns text) when comparing values but -> (returns JSON) when traversing

Arrays

-- Array operations
SELECT 
    -- Array construction
    ARRAY[1, 2, 3] AS numeric_array,
    ARRAY['red', 'green', 'blue'] AS color_array,
    string_to_array('apple,orange,banana', ',') AS fruit_array,
    
    -- Array access (1-based indexing)
    tags[1] AS first_tag,
    
    -- Array functions
    array_length(tags, 1) AS num_tags,
    array_to_string(tags, ', ') AS tag_list,
    array_append(tags, 'new_tag') AS tags_with_new,
    array_remove(tags, 'old_tag') AS tags_without_old,
    
    -- Array containment
    ARRAY[1, 2] <@ available_sizes AS size_available,
    
    -- Unnest arrays into rows
    unnest(category_ids) AS category_id
FROM 
    products;

Performance:

  • Array operations in WHERE clause can use GIN indexes: CREATE INDEX idx_product_tags ON products USING GIN (tags);
  • Unnesting large arrays creates many rows and can be slow

Extensions

-- Enable common extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "postgis";
CREATE EXTENSION IF NOT EXISTS "pg_trgm";

-- Use extension function
SELECT uuid_generate_v4();

-- Full-text search
CREATE EXTENSION IF NOT EXISTS pg_trgm;

SELECT 
    product_id,
    product_name,
    description,
    ts_rank(to_tsvector('english', description), query) AS rank
FROM 
    products,
    to_tsquery('english', 'comfortable & leather') AS query
WHERE 
    to_tsvector('english', description) @@ query
ORDER BY 
    rank DESC;

Performance:

  • Full-text search benefits greatly from specialized indexes:
CREATE INDEX idx_product_description_fts ON products 
USING GIN (to_tsvector('english', description));

Complex JOINs

INNER JOIN

-- Return only matching rows
SELECT 
    o.order_id, 
    c.customer_name,
    o.order_date
FROM 
    orders o
INNER JOIN 
    customers c ON o.customer_id = c.customer_id;

LEFT JOIN

-- Return all rows from left table, matching from right
SELECT 
    c.customer_id,
    c.customer_name,
    COUNT(o.order_id) AS order_count
FROM 
    customers c
LEFT JOIN 
    orders o ON c.customer_id = o.customer_id
GROUP BY 
    c.customer_id, 
    c.customer_name;

RIGHT JOIN

-- Return all rows from right table, matching from left
SELECT 
    p.product_id,
    p.product_name,
    o.order_id
FROM 
    order_items oi
RIGHT JOIN 
    products p ON oi.product_id = p.product_id
LEFT JOIN 
    orders o ON oi.order_id = o.order_id;

FULL JOIN

-- Return all rows from both tables
SELECT 
    e.employee_id,
    e.employee_name,
    d.department_name
FROM 
    employees e
FULL JOIN 
    departments d ON e.department_id = d.department_id;

Self JOIN

-- Join a table to itself
SELECT 
    e.employee_name AS employee,
    m.employee_name AS manager
FROM 
    employees e
LEFT JOIN 
    employees m ON e.manager_id = m.employee_id;

Window Functions

ROW_NUMBER, RANK, DENSE_RANK

SELECT 
    product_name,
    category,
    price,
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) AS row_num,
    RANK() OVER (PARTITION BY category ORDER BY price DESC) AS rank,
    DENSE_RANK() OVER (PARTITION BY category ORDER BY price DESC) AS dense_rank
FROM 
    products;

LAG/LEAD Functions

-- Compare current row with previous/next rows
SELECT 
    order_date,
    total_amount,
    LAG(total_amount) OVER (ORDER BY order_date) AS prev_amount,
    LEAD(total_amount) OVER (ORDER BY order_date) AS next_amount,
    total_amount - LAG(total_amount) OVER (ORDER BY order_date) AS amount_change
FROM 
    orders;

Running Totals

-- Calculate cumulative sum
SELECT 
    order_date,
    total_amount,
    SUM(total_amount) OVER (ORDER BY order_date) AS running_total,
    SUM(total_amount) OVER (
        ORDER BY order_date 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS explicit_running_total
FROM 
    orders;

Aggregations

GROUP BY with HAVING

-- Filter grouped results
SELECT 
    customer_id,
    COUNT(*) AS order_count,
    SUM(total_amount) AS total_spent,
    AVG(total_amount) AS avg_order_value
FROM 
    orders
GROUP BY 
    customer_id
HAVING 
    COUNT(*) > 5 AND SUM(total_amount) > 1000;

Multiple Grouping Levels

-- Group by multiple columns
SELECT 
    EXTRACT(YEAR FROM order_date) AS year,
    EXTRACT(MONTH FROM order_date) AS month,
    product_category,
    COUNT(*) AS order_count,
    SUM(quantity) AS units_sold
FROM 
    orders o
JOIN 
    order_items oi ON o.order_id = oi.order_id
JOIN 
    products p ON oi.product_id = p.product_id
GROUP BY 
    EXTRACT(YEAR FROM order_date),
    EXTRACT(MONTH FROM order_date),
    product_category
ORDER BY 
    year, month, product_category;

GROUP BY ROLLUP

-- Produce subtotals and grand totals
SELECT 
    COALESCE(category, 'All Categories') AS category,
    COALESCE(sub_category, 'All Sub-Categories') AS sub_category,
    SUM(sales) AS total_sales
FROM 
    sales
GROUP BY 
    ROLLUP(category, sub_category);

Advanced Subqueries and CTEs

Subquery in SELECT

SELECT 
    department_name,
    employee_count,
    (SELECT AVG(employee_count) FROM department_stats) AS avg_dept_size,
    employee_count - (SELECT AVG(employee_count) FROM department_stats) AS diff_from_avg
FROM 
    department_stats;

Performance: Each subquery in SELECT runs once per output row. Caching the scalar subquery result improves performance.

Correlated Subqueries

-- Find employees who earn more than their department's average
SELECT 
    e.employee_id,
    e.employee_name,
    e.salary,
    d.department_name
FROM 
    employees e
JOIN 
    departments d ON e.department_id = d.department_id
WHERE 
    e.salary > (
        SELECT AVG(salary)
        FROM employees
        WHERE department_id = e.department_id
    );

Performance: Correlated subqueries execute once per outer row, potentially causing poor performance with large datasets. Consider rewriting using window functions.

Subquery in FROM

SELECT 
    category,
    AVG(price) AS avg_price
FROM 
    (
        SELECT 
            category, 
            product_name, 
            price,
            ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) AS price_rank
        FROM 
            products
    ) AS ranked_products
WHERE 
    price_rank <= 5
GROUP BY 
    category;

Performance: Subqueries in FROM are materialized first, creating a temporary result set. Useful for complex calculations but can be memory-intensive.

Nested Subqueries

-- Find customers who have ordered products that have been ordered by at least 5 other customers
SELECT DISTINCT 
    c.customer_id, 
    c.customer_name
FROM 
    customers c
JOIN 
    orders o ON c.customer_id = o.customer_id
WHERE 
    EXISTS (
        SELECT 1
        FROM order_items oi
        WHERE oi.order_id = o.order_id
        AND oi.product_id IN (
            SELECT product_id
            FROM order_items oi2
            JOIN orders o2 ON oi2.order_id = o2.order_id
            WHERE o2.customer_id != c.customer_id
            GROUP BY product_id
            HAVING COUNT(DISTINCT o2.customer_id) >= 5
        )
    );

Performance: Deeply nested subqueries can be difficult to optimize. Often better to rewrite using CTEs for clarity and performance.

Common Table Expressions (CTEs)

-- Recursive CTE to generate date series
WITH RECURSIVE date_series AS (
    SELECT '2023-01-01'::DATE AS date
    UNION ALL
    SELECT date + 1
    FROM date_series
    WHERE date < '2023-01-31'
)
SELECT 
    d.date,
    COALESCE(COUNT(o.order_id), 0) AS order_count
FROM 
    date_series d
LEFT JOIN 
    orders o ON d.date = DATE(o.order_date)
GROUP BY 
    d.date
ORDER BY 
    d.date;

Performance: Recursive CTEs can be expensive for large recursion depths. PostgreSQL may not optimize CTEs as aggressively as subqueries in some versions.

Hierarchical Data with Recursive CTEs

-- Employee hierarchy with levels
WITH RECURSIVE emp_hierarchy AS (
    -- Base case: top-level employees (no manager)
    SELECT 
        employee_id, 
        employee_name, 
        manager_id, 
        1 AS level,
        employee_name::TEXT AS path
    FROM 
        employees
    WHERE 
        manager_id IS NULL
    
    UNION ALL
    
    -- Recursive case: employees with managers
    SELECT 
        e.employee_id, 
        e.employee_name, 
        e.manager_id, 
        h.level + 1,
        h.path || ' > ' || e.employee_name
    FROM 
        employees e
    JOIN 
        emp_hierarchy h ON e.manager_id = h.employee_id
)
SELECT 
    employee_id,
    employee_name,
    level,
    path
FROM 
    emp_hierarchy
ORDER BY 
    path;

Performance: Set a reasonable recursion limit to prevent infinite loops. Use indexes on join columns.

Multiple CTEs for Complex Analysis

WITH 
-- First CTE: Calculate revenue by product
product_revenue AS (
    SELECT 
        p.product_id,
        p.product_name,
        SUM(oi.quantity * oi.unit_price) AS revenue
    FROM 
        products p
    JOIN 
        order_items oi ON p.product_id = oi.product_id
    JOIN 
        orders o ON oi.order_id = o.order_id
    WHERE 
        o.order_date >= CURRENT_DATE - INTERVAL '1 year'
    GROUP BY 
        p.product_id, p.product_name
),

-- Second CTE: Calculate revenue percentiles
revenue_stats AS (
    SELECT 
        product_id,
        product_name,
        revenue,
        PERCENT_RANK() OVER (ORDER BY revenue) AS percentile
    FROM 
        product_revenue
),

-- Third CTE: Categorize products by revenue quartile
revenue_quartiles AS (
    SELECT 
        product_id,
        product_name,
        revenue,
        CASE 
            WHEN percentile < 0.25 THEN 'Low Revenue'
            WHEN percentile < 0.5 THEN 'Below Average'
            WHEN percentile < 0.75 THEN 'Above Average'
            ELSE 'High Revenue'
        END AS revenue_category
    FROM 
        revenue_stats
)

-- Main query: Get order counts by revenue category
SELECT 
    rq.revenue_category,
    COUNT(DISTINCT o.order_id) AS order_count,
    COUNT(DISTINCT o.customer_id) AS customer_count,
    SUM(oi.quantity) AS total_units_sold,
    SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM 
    revenue_quartiles rq
JOIN 
    order_items oi ON rq.product_id = oi.product_id
JOIN 
    orders o ON oi.order_id = o.order_id
WHERE 
    o.order_date >= CURRENT_DATE - INTERVAL '1 year'
GROUP BY 
    rq.revenue_category
ORDER BY 
    total_revenue DESC;

Performance: Breaking complex queries into CTEs improves readability and often optimization. PostgreSQL 12+ will inline simple CTEs for better performance.

UNION, INTERSECT, and EXCEPT Operations

UNION and UNION ALL

-- Combine customers and suppliers into a contact list
SELECT 
    customer_id AS contact_id,
    customer_name AS contact_name,
    'Customer' AS contact_type,
    city,
    country
FROM 
    customers

UNION ALL -- keeps duplicates, faster than UNION

SELECT 
    supplier_id AS contact_id,
    supplier_name AS contact_name,
    'Supplier' AS contact_type,
    city,
    country
FROM 
    suppliers
ORDER BY 
    country, city, contact_name;

Performance:

  • UNION eliminates duplicates (requires sorting) - slower but ensures uniqueness
  • UNION ALL retains all rows (no deduplication) - significantly faster, use when duplicates impossible or acceptable

INTERSECT and EXCEPT

-- Find customers who have placed orders but never returned anything
SELECT customer_id FROM customers
INTERSECT
SELECT customer_id FROM orders
EXCEPT
SELECT customer_id FROM returns;

-- Find products that have been both ordered and returned in the same month
SELECT 
    p.product_id, 
    p.product_name
FROM 
    products p
WHERE 
    EXISTS (
        SELECT 1
        FROM order_items oi
        JOIN orders o ON oi.order_id = o.order_id
        WHERE 
            oi.product_id = p.product_id
            AND EXISTS (
                SELECT 1
                FROM returns r
                WHERE 
                    r.order_id = o.order_id
                    AND EXTRACT(MONTH FROM r.return_date) = EXTRACT(MONTH FROM o.order_date)
                    AND EXTRACT(YEAR FROM r.return_date) = EXTRACT(YEAR FROM o.order_date)
            )
    );

Performance:

  • INTERSECT: Finds common rows (requires sorting) - moderate performance
  • EXCEPT: Finds rows in first query not in second (requires sorting) - moderate performance
  • Alternative rewrites with EXISTS or NOT EXISTS can be faster with proper indexing

CASE Statements and Conditional Logic

Simple CASE

SELECT 
    product_name,
    price,
    CASE 
        WHEN price < 50 THEN 'Budget'
        WHEN price BETWEEN 50 AND 100 THEN 'Mid-range'
        WHEN price > 100 THEN 'Premium'
        ELSE 'Unknown'
    END AS price_category
FROM 
    products;

CASE in Group By

-- Group by calculated categories
SELECT 
    CASE 
        WHEN age < 18 THEN 'Under 18'
        WHEN age BETWEEN 18 AND 24 THEN '18-24'
        WHEN age BETWEEN 25 AND 34 THEN '25-34'
        WHEN age BETWEEN 35 AND 44 THEN '35-44'
        WHEN age BETWEEN 45 AND 54 THEN '45-54'
        ELSE '55+'
    END AS age_group,
    COUNT(*) AS customer_count,
    AVG(annual_spend) AS avg_spend
FROM 
    customers
GROUP BY 
    CASE 
        WHEN age < 18 THEN 'Under 18'
        WHEN age BETWEEN 18 AND 24 THEN '18-24'
        WHEN age BETWEEN 25 AND 34 THEN '25-34'
        WHEN age BETWEEN 35 AND 44 THEN '35-44'
        WHEN age BETWEEN 45 AND 54 THEN '45-54'
        ELSE '55+'
    END;

CASE with Aggregations

-- Pivot data using CASE
SELECT 
    product_category,
    SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2021 THEN order_amount ELSE 0 END) AS sales_2021,
    SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2022 THEN order_amount ELSE 0 END) AS sales_2022,
    SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2023 THEN order_amount ELSE 0 END) AS sales_2023
FROM 
    sales
GROUP BY 
    product_category
ORDER BY 
    product_category;

PostgreSQL Indexes and Performance

Index Types

-- B-tree index (default): equality and range queries
CREATE INDEX idx_customers_name ON customers(customer_name);

-- Hash index: equality comparisons only, faster than B-tree for equality
CREATE INDEX idx_orders_status_hash ON orders USING HASH (status);

-- GiST (Generalized Search Tree): geospatial data, full-text search
CREATE INDEX idx_locations_position ON locations USING GIST (coordinates);

-- GIN (Generalized Inverted Index): for columns containing arrays/jsonb
CREATE INDEX idx_products_tags ON products USING GIN (tags);

-- BRIN (Block Range Index): for large tables with ordered data
CREATE INDEX idx_logs_time_brin ON logs USING BRIN (timestamp);

-- Multicolumn index: for queries with multiple conditions
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);

-- Expression index: for queries with expressions
CREATE INDEX idx_customer_lower_email ON customers(LOWER(email));

-- Partial index: index only a subset of rows
CREATE INDEX idx_orders_recent ON orders(order_date) 
WHERE order_date > CURRENT_DATE - INTERVAL '3 months';

Index Performance Considerations

1. Column Selectivity
   - High selectivity (many unique values): Good for B-tree indexes
   - Low selectivity (few unique values): Consider partial indexes

2. Query Patterns
   - Equality (col = value): B-tree or Hash
   - Range (col > value): B-tree only
   - Pattern matching: GIN with pg_trgm for LIKE/ILIKE
   - Full-text search: GIN with tsvector

3. Write Impact
   - Indexes speed up SELECT but slow down INSERT/UPDATE/DELETE
   - Consider index maintenance overhead for write-heavy tables

4. Size Considerations
   - GIN indexes are larger but faster for complex searches
   - BRIN indexes are very small but less precise

5. Common Mistakes
   - Over-indexing: Adds overhead without benefit
   - Wrong index type: e.g., B-tree for full-text search
   - Wrong column order in multicolumn indexes

Covering Indexes (INCLUDE)

-- Create an index that "covers" a query (includes all needed columns)
CREATE INDEX idx_orders_customer_include_date ON orders(customer_id) INCLUDE (order_date, status);

-- Query that can be resolved entirely from the index (index-only scan)
SELECT order_date, status FROM orders WHERE customer_id = 123;

Performance Benefit: Eliminates table lookups, allowing index-only scans

Query Performance Analysis

EXPLAIN Command

-- Basic execution plan (estimation only)
EXPLAIN 
SELECT c.customer_name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name;

-- Detailed execution plan with actual timing
EXPLAIN ANALYZE
SELECT c.customer_name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name;

-- Get plan in JSON format for programmatic analysis
EXPLAIN (FORMAT JSON, ANALYZE, BUFFERS)
SELECT c.customer_name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name;

Interpreting Query Plans

Example EXPLAIN output with annotations:

EXPLAIN ANALYZE
SELECT c.customer_name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name;

# Output:
HashAggregate  (cost=219.27..221.27 rows=200 width=40) (actual time=0.968..1.001 rows=200 loops=1)
  # Groups by customer_name and computes count
  Group Key: c.customer_name
  ->  Hash Join  (cost=11.75..207.27 rows=1200 width=40) (actual time=0.208..0.771 rows=1200 loops=1)
        # Joins customer and order tables using hash
        Hash Cond: (o.customer_id = c.customer_id)
        ->  Seq Scan on orders o  (cost=0.00..127.00 rows=1200 width=8) (actual time=0.011..0.316 rows=1200 loops=1)
            # Full table scan on orders
        ->  Hash  (cost=8.00..8.00 rows=300 width=40) (actual time=0.153..0.153 rows=300 loops=1)
              Buckets: 1024  Batches: 1  Memory Usage: 22kB
              ->  Seq Scan on customers c  (cost=0.00..8.00 rows=300 width=40) (actual time=0.009..0.077 rows=300 loops=1)
                  # Full table scan on customers
Planning Time: 0.273 ms
Execution Time: 1.050 ms

Key Scan Types (Best to Worst)

1. Index Only Scan - Retrieves data directly from index without table access
   ✓ Fastest possible scan when applicable
   ✓ Requires VACUUM to keep visibility map updated
   
2. Index Scan - Uses index to find rows, then fetches from table
   ✓ Good for retrieving small portion of table
   ✓ Efficient for high-selectivity conditions

3. Bitmap Index Scan - Builds bitmap of qualifying rows first
   ✓ Efficient for retrieving medium-sized result sets
   ✓ Can combine results from multiple indexes

4. Seq Scan - Reads entire table
   ✓ Best for retrieving large portions of a table
   ✓ When no useful index exists or optimizer decides full scan is faster
   
5. CTE Scan - Materializes a CTE before using it
   ✓ Can be beneficial for complex CTEs used multiple times
   ✓ But prevents some query optimizations

Common Performance Issues and Solutions

-- Issue: Missing join index
-- Solution: Create index on join column
CREATE INDEX idx_orders_customer_id ON orders(customer_id);

-- Issue: Inefficient LIKE pattern
-- Solution: Use pg_trgm extension and GIN index
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_products_name_trgm ON products USING GIN (product_name gin_trgm_ops);
-- Now efficient: SELECT * FROM products WHERE product_name LIKE '%keyword%';

-- Issue: Full table scans on large tables
-- Solution: Add appropriate indexes and rewrite query
-- Before: WHERE EXTRACT(YEAR FROM date_column) = 2023
-- After: WHERE date_column BETWEEN '2023-01-01' AND '2023-12-31'

-- Issue: Slow aggregations
-- Solution: Use materialized views for frequently needed aggregates
CREATE MATERIALIZED VIEW monthly_sales AS
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS monthly_revenue
FROM 
    orders
GROUP BY 
    DATE_TRUNC('month', order_date);
CREATE UNIQUE INDEX idx_monthly_sales_month ON monthly_sales(month);
-- Refresh as needed: REFRESH MATERIALIZED VIEW monthly_sales;

SQL Assessment Time Management Tips

Quick Planning Steps

  1. Read the entire question before starting
  2. Identify tables needed and their relationships
  3. List required columns in the output
  4. Determine aggregations/groupings needed
  5. Sketch the query structure on paper/comment

Solving Approaches

-- Start with a simple structure that returns SOMETHING
SELECT customer_id, order_date
FROM orders
WHERE order_date > '2023-01-01';

-- Add complexity incrementally
SELECT 
    c.customer_name,
    COUNT(o.order_id) AS order_count,
    SUM(o.total_amount) AS total_spent
FROM 
    customers c
JOIN 
    orders o ON c.customer_id = o.customer_id
WHERE 
    o.order_date > '2023-01-01'
GROUP BY 
    c.customer_name;

Verification Tips

  • Run queries against small subsets first
  • Use LIMIT to check initial results
  • Test edge cases: NULL values, empty result sets, duplicates
  • Review joins to ensure proper cardinality

General Tips

Trade-off Discussion Framework

1. Clearly state the options
2. Technical pros and cons:
   - Performance considerations
   - Scalability implications
   - Maintainability aspects
   - Security considerations
3. Business pros and cons:
   - Cost implications
   - Time to market
   - User experience impact
   - Future flexibility
4. Recommendation with clear rationale

Complex Technical Challenge Template

1. Problem Statement:
   "We needed to reduce API latency by 50% while handling 10x more traffic"

2. Constraints:
   - No additional hardware budget
   - Must maintain backward compatibility
   - 99.99% uptime requirement

3. Approach:
   - Profiled application to identify bottlenecks
   - Implemented multi-level caching strategy
   - Optimized database queries and added indexes
   - Introduced connection pooling

4. Results:
   - Achieved 70% latency reduction
   - Successfully handled 20x traffic spike during holiday season
   - Reduced infrastructure costs by 15%

Explaining Technical Concepts at Different Levels

1. Executive Level (Why it matters):
   "This architecture reduces our time-to-market for new features by 40%"

2. Product Management Level (What it enables):
   "This approach allows us to deploy new features independently without system-wide testing"

3. Technical Peer Level (How it works):
   "We're using a message queue with dead-letter handling to ensure reliable event processing"

4. Junior Engineer Level (Implementation details):
   "Let me walk through the event flow and error handling patterns we're using..."

Business Impact Demonstration

Always connect technical decisions to business outcomes:

- Revenue: "This optimization increased conversion rate by 0.5%, adding $1M in annual revenue"
- Cost: "Migrating to serverless reduced infrastructure costs by $250K annually"
- Time: "The new CI/CD pipeline reduced release cycle from 2 weeks to 2 days"
- Quality: "Automated testing reduced production bugs by 60%"
- Risk: "Multi-region deployment reduced our disaster recovery time from 4 hours to 5 minutes"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment