- Verbalize your thought process: "I'm thinking we should check for edge cases here..."
- Explain trade-offs: "We could use a hash map for O(1) lookup, but it would cost O(n) space..."
- Articulate assumptions: "I'm assuming the input array is not sorted..."
- Narrate your debugging process: "Let's trace through with this example input..."
# Before: O(n²) time complexity
def find_duplicate(arr):
for i in range(len(arr)):
for j in range(i+1, len(arr)):
if arr[i] == arr[j]:
return arr[i]
return -1
# After: O(n) time complexity, O(n) space complexity
def find_duplicate_optimized(arr):
seen = set()
for num in arr:
if num in seen:
return num
seen.add(num)
return -1
# Further optimization: Floyd's Tortoise and Hare (for specific cases)
def find_duplicate_cycle(arr):
slow = fast = arr[0]
while True:
slow = arr[slow]
fast = arr[arr[fast]]
if slow == fast:
break
slow = arr[0]
while slow != fast:
slow = arr[slow]
fast = arr[fast]
return slow
# Before
def p(d, s):
r = []
for i in range(len(d)):
if d[i] > s:
r.append(i)
return r
# After
def find_elements_greater_than_threshold(data, threshold):
"""Returns indices of elements that exceed the given threshold."""
result_indices = []
for index, value in enumerate(data):
if value > threshold:
result_indices.append(index)
return result_indices
# Poor error handling
def divide(a, b):
return a / b # Will crash if b = 0
# Better error handling
def divide_safely(a, b):
if b == 0:
return None # Or raise a custom exception
return a / b
# Comprehensive error handling
def divide_with_validation(a, b):
try:
if not isinstance(a, (int, float)) or not isinstance(b, (int, float)):
raise TypeError("Both inputs must be numbers")
if b == 0:
raise ValueError("Cannot divide by zero")
return a / b
except Exception as e:
logging.error(f"Error in division: {e}")
# Handle appropriately or re-raise
raise
# Unit testing example
def test_find_duplicate():
# Test normal case
assert find_duplicate([1, 2, 3, 2, 4]) == 2
# Test edge cases
assert find_duplicate([]) == -1
assert find_duplicate([1]) == -1
# Test with no duplicates
assert find_duplicate([1, 2, 3, 4]) == -1
# Performance test with large input
large_input = list(range(10000)) + [5]
assert find_duplicate(large_input) == 5
- Examples to mention:
- "I refactored a monolithic function into smaller, testable units reducing bug rate by 40%"
- "I introduced caching to a frequently called API endpoint, reducing response time by 300ms"
- "I standardized error handling across the service, improving debuggability"
- "I added comprehensive logging, enabling us to identify a subtle race condition"
- Vertical Scaling: Upgrading existing machines (more CPU, RAM)
- ✅ Simple to implement
- ❌ Hardware limits, single point of failure
- Horizontal Scaling: Adding more machines
- ✅ Theoretically unlimited scaling, fault tolerance
- ❌ More complex architecture, data consistency challenges
Client → [ Load Balancer ] → Server 1
→ Server 2
→ Server 3
- Strategies: Round robin, least connections, IP hash
- Sticky Sessions: Route user to same server for session duration
- Health Checks: Automatically remove unhealthy servers
- Consistency: All nodes see the same data at the same time
- Availability: Every request receives a response
- Partition Tolerance: System continues to operate despite network failures
- You can only guarantee two of three
CA: Traditional RDBMS (PostgreSQL, MySQL)
CP: MongoDB, HBase, Redis
AP: Cassandra, DynamoDB, CouchDB
- Strong Consistency: All reads reflect all prior writes
- Eventual Consistency: Given enough time, all replicas will converge
- Causal Consistency: Operations causally related must be seen in same order
- Session Consistency: Client sees its own updates
- Circuit Breaker Pattern:
// Pseudo-code
CircuitBreaker breaker = new CircuitBreaker(threshold=5, timeout=60s);
public Response callService() {
if (breaker.isOpen()) {
return fallbackResponse();
}
try {
Response response = service.call();
breaker.recordSuccess();
return response;
} catch (Exception e) {
breaker.recordFailure();
return fallbackResponse();
}
}
┌───────────────────────────────┐
│ Monolith │
│ ┌─────────┐ ┌─────────────┐ │
│ │ UI │ │ Business │ │
│ │ │ │ Logic │ │
│ └─────────┘ └─────────────┘ │
│ ┌──────────────────────────┐ │
│ │ Database │ │
│ └──────────────────────────┘ │
└───────────────────────────────┘
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Service A │ │ Service B │ │ Service C │
│ ┌─────────┐│ │┌─────────┐│ │┌─────────┐│
│ │ Business││ ││ Business││ ││ Business││
│ │ Logic ││ ││ Logic ││ ││ Logic ││
│ └─────────┘│ │└─────────┘│ │└─────────┘│
│ ┌─────────┐│ │┌─────────┐│ │┌─────────┐│
│ │ DB ││ ││ DB ││ ││ DB ││
│ └─────────┘│ │└─────────┘│ │└─────────┘│
└───────────┘ └───────────┘ └───────────┘
- Monoliths
- ✅ Simpler development, deployment, debugging
- ✅ Better performance (no network calls)
- ❌ Scaling challenges, technology lock-in
- Microservices
- ✅ Independent scaling and deployment
- ✅ Technology diversity, team autonomy
- ❌ Distributed system complexity
- ❌ Service boundary design challenges
SQL (Relational) | NoSQL |
---|---|
Structured data | Varied data structures |
ACID transactions | BASE (Basically Available, Soft state, Eventually consistent) |
Rigid schema | Flexible schema |
Vertical scaling | Horizontal scaling |
Examples: PostgreSQL, MySQL | Examples: MongoDB, Cassandra, Redis |
# Cache-Aside Pattern
def get_user(user_id):
# Try to get from cache
user = cache.get(f"user:{user_id}")
if user is not None:
return user
# Cache miss, get from database
user = db.query(f"SELECT * FROM users WHERE id = {user_id}")
# Update cache
cache.set(f"user:{user_id}", user, expiry=3600)
return user
- Cache Invalidation Strategies:
- TTL (Time-To-Live)
- Write-through (update cache when writing to DB)
- Write-behind (async update to DB after cache write)
- Write-around (write to DB, let cache expire naturally)
Producer → [Queue] → Consumer
- Benefits:
- Decoupling producers from consumers
- Buffer for traffic spikes
- Ensuring message delivery
- Enabling parallel processing
┌─────────────┐
│Event Source │
└──────┬──────┘
│
┌──────▼──────┐
│Event Channel│
└──────┬──────┘
│
┌─────────────┼────────────┐
│ │ │
┌───────▼─────┐ ┌─────▼───────┐ ┌─▼────────────┐
│Consumer A │ │Consumer B │ │Consumer C │
└─────────────┘ └─────────────┘ └──────────────┘
- Examples: Kafka, RabbitMQ, AWS SQS/SNS, Google Pub/Sub
- Situation: "We needed to migrate our monolith to microservices"
- Task: "I was tasked with leading the authentication service migration"
- Action: "I created a phased migration plan, introduced feature flags for safe rollbacks, and implemented parallel running with traffic mirroring"
- Result: "We completed the migration with zero downtime and improved auth response times by 40%"
- Technical Debt Reduction: "Reduced CI pipeline times from 45 to 12 minutes"
- Architecture Evolution: "Led transition from batch processing to streaming architecture"
- Performance Optimization: "Spearheaded database optimization reducing query times by 85%"
- Pair Programming: Regular sessions focused on specific skills
- Code Reviews: Thoughtful feedback with explanations and resources
- Architecture Decision Records: Documenting decisions and reasoning
- Tech Talks: Organizing knowledge sharing sessions
- "Mentored junior engineer who became team lead in 18 months"
- "Established onboarding process reducing time-to-productivity by 40%"
- "Created internal documentation that became company standard"
1. Clearly state the problem
2. List constraints (time, budget, people, tech)
3. Identify 2-3 viable options
4. Analyze trade-offs (performance, scalability, maintainability)
5. Make recommendation with rationale
6. Define success metrics
- Decision: "Move from self-hosted to managed Kubernetes"
- Business Impact: "Reduced ops overhead by 30%, enabling team to focus on product features"
- Technical Impact: "Improved reliability from 99.9% to 99.99% uptime"
1. Focus on shared goals
2. Base discussion on data when possible
3. Separate facts from opinions
4. Create pros/cons list for each approach
5. Consider short-term vs long-term implications
6. Agree on decision-making framework
7. Commit to decision once made
- "Team disagreed on GraphQL vs REST API approach"
- "I organized a spike comparing performance, developer experience, and client needs"
- "Data showed GraphQL improved mobile app performance significantly"
- "We agreed on GraphQL with performance monitoring in place"
- Interface Contracts: Clearly defined API contracts
- RFC Process: Request For Comments documentation
- Joint Planning: Cross-team planning sessions
- Embedded Team Members: Temporary cross-team assignments
- "Partnered with data science team to implement ML feature"
- "Created shared roadmap with clear responsibilities"
- "Established weekly sync meetings focusing on integration points"
- "Result: Successfully launched feature ahead of schedule"
- "What's the most interesting technical challenge the team has solved recently?"
- "How do you balance technical debt against new feature development?"
- "What's your approach to keeping technology stack current?"
- "How are architecture decisions made and documented?"
- "How do you handle technical disagreements within the team?"
- "Can you walk me through a recent significant technical decision?"
- "How do you approach mentorship and career growth?"
- "What skills are you looking to develop in the team?"
- "How do you measure engineering effectiveness?"
-- PostgreSQL-specific data types
CREATE TABLE example (
id SERIAL PRIMARY KEY,
text_data TEXT,
json_data JSONB,
array_data INTEGER[],
range_data INT4RANGE,
enum_field my_enum_type,
ip_address INET
);
-- JSON creation and manipulation
SELECT
-- Create JSON object
jsonb_build_object(
'name', customer_name,
'email', email,
'orders', order_count
) AS customer_json,
-- Extract value from JSON (multiple ways)
customer_data->'address'->'city' AS city,
customer_data->>'country' AS country, -- Extract as text
jsonb_path_query(customer_data, '$.preferences[*] ? (@.type == "primary")') AS primary_prefs,
-- Check for key existence
customer_data ? 'premium' AS is_premium,
-- Update JSON values
jsonb_set(
customer_data,
'{last_login}',
to_jsonb(CURRENT_TIMESTAMP)
) AS updated_data
FROM
customers;
Performance:
- Use JSONB (not JSON) for better performance and indexing
- GIN indexes can dramatically speed up JSON queries:
CREATE INDEX idx_customer_data ON customers USING GIN (customer_data);
- Use
->>'key'
operator (returns text) when comparing values but->
(returns JSON) when traversing
-- Array operations
SELECT
-- Array construction
ARRAY[1, 2, 3] AS numeric_array,
ARRAY['red', 'green', 'blue'] AS color_array,
string_to_array('apple,orange,banana', ',') AS fruit_array,
-- Array access (1-based indexing)
tags[1] AS first_tag,
-- Array functions
array_length(tags, 1) AS num_tags,
array_to_string(tags, ', ') AS tag_list,
array_append(tags, 'new_tag') AS tags_with_new,
array_remove(tags, 'old_tag') AS tags_without_old,
-- Array containment
ARRAY[1, 2] <@ available_sizes AS size_available,
-- Unnest arrays into rows
unnest(category_ids) AS category_id
FROM
products;
Performance:
- Array operations in WHERE clause can use GIN indexes:
CREATE INDEX idx_product_tags ON products USING GIN (tags);
- Unnesting large arrays creates many rows and can be slow
-- Enable common extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS "postgis";
CREATE EXTENSION IF NOT EXISTS "pg_trgm";
-- Use extension function
SELECT uuid_generate_v4();
-- Full-text search
CREATE EXTENSION IF NOT EXISTS pg_trgm;
SELECT
product_id,
product_name,
description,
ts_rank(to_tsvector('english', description), query) AS rank
FROM
products,
to_tsquery('english', 'comfortable & leather') AS query
WHERE
to_tsvector('english', description) @@ query
ORDER BY
rank DESC;
Performance:
- Full-text search benefits greatly from specialized indexes:
CREATE INDEX idx_product_description_fts ON products
USING GIN (to_tsvector('english', description));
-- Return only matching rows
SELECT
o.order_id,
c.customer_name,
o.order_date
FROM
orders o
INNER JOIN
customers c ON o.customer_id = c.customer_id;
-- Return all rows from left table, matching from right
SELECT
c.customer_id,
c.customer_name,
COUNT(o.order_id) AS order_count
FROM
customers c
LEFT JOIN
orders o ON c.customer_id = o.customer_id
GROUP BY
c.customer_id,
c.customer_name;
-- Return all rows from right table, matching from left
SELECT
p.product_id,
p.product_name,
o.order_id
FROM
order_items oi
RIGHT JOIN
products p ON oi.product_id = p.product_id
LEFT JOIN
orders o ON oi.order_id = o.order_id;
-- Return all rows from both tables
SELECT
e.employee_id,
e.employee_name,
d.department_name
FROM
employees e
FULL JOIN
departments d ON e.department_id = d.department_id;
-- Join a table to itself
SELECT
e.employee_name AS employee,
m.employee_name AS manager
FROM
employees e
LEFT JOIN
employees m ON e.manager_id = m.employee_id;
SELECT
product_name,
category,
price,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) AS row_num,
RANK() OVER (PARTITION BY category ORDER BY price DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY category ORDER BY price DESC) AS dense_rank
FROM
products;
-- Compare current row with previous/next rows
SELECT
order_date,
total_amount,
LAG(total_amount) OVER (ORDER BY order_date) AS prev_amount,
LEAD(total_amount) OVER (ORDER BY order_date) AS next_amount,
total_amount - LAG(total_amount) OVER (ORDER BY order_date) AS amount_change
FROM
orders;
-- Calculate cumulative sum
SELECT
order_date,
total_amount,
SUM(total_amount) OVER (ORDER BY order_date) AS running_total,
SUM(total_amount) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS explicit_running_total
FROM
orders;
-- Filter grouped results
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spent,
AVG(total_amount) AS avg_order_value
FROM
orders
GROUP BY
customer_id
HAVING
COUNT(*) > 5 AND SUM(total_amount) > 1000;
-- Group by multiple columns
SELECT
EXTRACT(YEAR FROM order_date) AS year,
EXTRACT(MONTH FROM order_date) AS month,
product_category,
COUNT(*) AS order_count,
SUM(quantity) AS units_sold
FROM
orders o
JOIN
order_items oi ON o.order_id = oi.order_id
JOIN
products p ON oi.product_id = p.product_id
GROUP BY
EXTRACT(YEAR FROM order_date),
EXTRACT(MONTH FROM order_date),
product_category
ORDER BY
year, month, product_category;
-- Produce subtotals and grand totals
SELECT
COALESCE(category, 'All Categories') AS category,
COALESCE(sub_category, 'All Sub-Categories') AS sub_category,
SUM(sales) AS total_sales
FROM
sales
GROUP BY
ROLLUP(category, sub_category);
SELECT
department_name,
employee_count,
(SELECT AVG(employee_count) FROM department_stats) AS avg_dept_size,
employee_count - (SELECT AVG(employee_count) FROM department_stats) AS diff_from_avg
FROM
department_stats;
Performance: Each subquery in SELECT runs once per output row. Caching the scalar subquery result improves performance.
-- Find employees who earn more than their department's average
SELECT
e.employee_id,
e.employee_name,
e.salary,
d.department_name
FROM
employees e
JOIN
departments d ON e.department_id = d.department_id
WHERE
e.salary > (
SELECT AVG(salary)
FROM employees
WHERE department_id = e.department_id
);
Performance: Correlated subqueries execute once per outer row, potentially causing poor performance with large datasets. Consider rewriting using window functions.
SELECT
category,
AVG(price) AS avg_price
FROM
(
SELECT
category,
product_name,
price,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) AS price_rank
FROM
products
) AS ranked_products
WHERE
price_rank <= 5
GROUP BY
category;
Performance: Subqueries in FROM are materialized first, creating a temporary result set. Useful for complex calculations but can be memory-intensive.
-- Find customers who have ordered products that have been ordered by at least 5 other customers
SELECT DISTINCT
c.customer_id,
c.customer_name
FROM
customers c
JOIN
orders o ON c.customer_id = o.customer_id
WHERE
EXISTS (
SELECT 1
FROM order_items oi
WHERE oi.order_id = o.order_id
AND oi.product_id IN (
SELECT product_id
FROM order_items oi2
JOIN orders o2 ON oi2.order_id = o2.order_id
WHERE o2.customer_id != c.customer_id
GROUP BY product_id
HAVING COUNT(DISTINCT o2.customer_id) >= 5
)
);
Performance: Deeply nested subqueries can be difficult to optimize. Often better to rewrite using CTEs for clarity and performance.
-- Recursive CTE to generate date series
WITH RECURSIVE date_series AS (
SELECT '2023-01-01'::DATE AS date
UNION ALL
SELECT date + 1
FROM date_series
WHERE date < '2023-01-31'
)
SELECT
d.date,
COALESCE(COUNT(o.order_id), 0) AS order_count
FROM
date_series d
LEFT JOIN
orders o ON d.date = DATE(o.order_date)
GROUP BY
d.date
ORDER BY
d.date;
Performance: Recursive CTEs can be expensive for large recursion depths. PostgreSQL may not optimize CTEs as aggressively as subqueries in some versions.
-- Employee hierarchy with levels
WITH RECURSIVE emp_hierarchy AS (
-- Base case: top-level employees (no manager)
SELECT
employee_id,
employee_name,
manager_id,
1 AS level,
employee_name::TEXT AS path
FROM
employees
WHERE
manager_id IS NULL
UNION ALL
-- Recursive case: employees with managers
SELECT
e.employee_id,
e.employee_name,
e.manager_id,
h.level + 1,
h.path || ' > ' || e.employee_name
FROM
employees e
JOIN
emp_hierarchy h ON e.manager_id = h.employee_id
)
SELECT
employee_id,
employee_name,
level,
path
FROM
emp_hierarchy
ORDER BY
path;
Performance: Set a reasonable recursion limit to prevent infinite loops. Use indexes on join columns.
WITH
-- First CTE: Calculate revenue by product
product_revenue AS (
SELECT
p.product_id,
p.product_name,
SUM(oi.quantity * oi.unit_price) AS revenue
FROM
products p
JOIN
order_items oi ON p.product_id = oi.product_id
JOIN
orders o ON oi.order_id = o.order_id
WHERE
o.order_date >= CURRENT_DATE - INTERVAL '1 year'
GROUP BY
p.product_id, p.product_name
),
-- Second CTE: Calculate revenue percentiles
revenue_stats AS (
SELECT
product_id,
product_name,
revenue,
PERCENT_RANK() OVER (ORDER BY revenue) AS percentile
FROM
product_revenue
),
-- Third CTE: Categorize products by revenue quartile
revenue_quartiles AS (
SELECT
product_id,
product_name,
revenue,
CASE
WHEN percentile < 0.25 THEN 'Low Revenue'
WHEN percentile < 0.5 THEN 'Below Average'
WHEN percentile < 0.75 THEN 'Above Average'
ELSE 'High Revenue'
END AS revenue_category
FROM
revenue_stats
)
-- Main query: Get order counts by revenue category
SELECT
rq.revenue_category,
COUNT(DISTINCT o.order_id) AS order_count,
COUNT(DISTINCT o.customer_id) AS customer_count,
SUM(oi.quantity) AS total_units_sold,
SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM
revenue_quartiles rq
JOIN
order_items oi ON rq.product_id = oi.product_id
JOIN
orders o ON oi.order_id = o.order_id
WHERE
o.order_date >= CURRENT_DATE - INTERVAL '1 year'
GROUP BY
rq.revenue_category
ORDER BY
total_revenue DESC;
Performance: Breaking complex queries into CTEs improves readability and often optimization. PostgreSQL 12+ will inline simple CTEs for better performance.
-- Combine customers and suppliers into a contact list
SELECT
customer_id AS contact_id,
customer_name AS contact_name,
'Customer' AS contact_type,
city,
country
FROM
customers
UNION ALL -- keeps duplicates, faster than UNION
SELECT
supplier_id AS contact_id,
supplier_name AS contact_name,
'Supplier' AS contact_type,
city,
country
FROM
suppliers
ORDER BY
country, city, contact_name;
Performance:
UNION
eliminates duplicates (requires sorting) - slower but ensures uniquenessUNION ALL
retains all rows (no deduplication) - significantly faster, use when duplicates impossible or acceptable
-- Find customers who have placed orders but never returned anything
SELECT customer_id FROM customers
INTERSECT
SELECT customer_id FROM orders
EXCEPT
SELECT customer_id FROM returns;
-- Find products that have been both ordered and returned in the same month
SELECT
p.product_id,
p.product_name
FROM
products p
WHERE
EXISTS (
SELECT 1
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
WHERE
oi.product_id = p.product_id
AND EXISTS (
SELECT 1
FROM returns r
WHERE
r.order_id = o.order_id
AND EXTRACT(MONTH FROM r.return_date) = EXTRACT(MONTH FROM o.order_date)
AND EXTRACT(YEAR FROM r.return_date) = EXTRACT(YEAR FROM o.order_date)
)
);
Performance:
INTERSECT
: Finds common rows (requires sorting) - moderate performanceEXCEPT
: Finds rows in first query not in second (requires sorting) - moderate performance- Alternative rewrites with
EXISTS
orNOT EXISTS
can be faster with proper indexing
SELECT
product_name,
price,
CASE
WHEN price < 50 THEN 'Budget'
WHEN price BETWEEN 50 AND 100 THEN 'Mid-range'
WHEN price > 100 THEN 'Premium'
ELSE 'Unknown'
END AS price_category
FROM
products;
-- Group by calculated categories
SELECT
CASE
WHEN age < 18 THEN 'Under 18'
WHEN age BETWEEN 18 AND 24 THEN '18-24'
WHEN age BETWEEN 25 AND 34 THEN '25-34'
WHEN age BETWEEN 35 AND 44 THEN '35-44'
WHEN age BETWEEN 45 AND 54 THEN '45-54'
ELSE '55+'
END AS age_group,
COUNT(*) AS customer_count,
AVG(annual_spend) AS avg_spend
FROM
customers
GROUP BY
CASE
WHEN age < 18 THEN 'Under 18'
WHEN age BETWEEN 18 AND 24 THEN '18-24'
WHEN age BETWEEN 25 AND 34 THEN '25-34'
WHEN age BETWEEN 35 AND 44 THEN '35-44'
WHEN age BETWEEN 45 AND 54 THEN '45-54'
ELSE '55+'
END;
-- Pivot data using CASE
SELECT
product_category,
SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2021 THEN order_amount ELSE 0 END) AS sales_2021,
SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2022 THEN order_amount ELSE 0 END) AS sales_2022,
SUM(CASE WHEN EXTRACT(YEAR FROM order_date) = 2023 THEN order_amount ELSE 0 END) AS sales_2023
FROM
sales
GROUP BY
product_category
ORDER BY
product_category;
-- B-tree index (default): equality and range queries
CREATE INDEX idx_customers_name ON customers(customer_name);
-- Hash index: equality comparisons only, faster than B-tree for equality
CREATE INDEX idx_orders_status_hash ON orders USING HASH (status);
-- GiST (Generalized Search Tree): geospatial data, full-text search
CREATE INDEX idx_locations_position ON locations USING GIST (coordinates);
-- GIN (Generalized Inverted Index): for columns containing arrays/jsonb
CREATE INDEX idx_products_tags ON products USING GIN (tags);
-- BRIN (Block Range Index): for large tables with ordered data
CREATE INDEX idx_logs_time_brin ON logs USING BRIN (timestamp);
-- Multicolumn index: for queries with multiple conditions
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);
-- Expression index: for queries with expressions
CREATE INDEX idx_customer_lower_email ON customers(LOWER(email));
-- Partial index: index only a subset of rows
CREATE INDEX idx_orders_recent ON orders(order_date)
WHERE order_date > CURRENT_DATE - INTERVAL '3 months';
1. Column Selectivity
- High selectivity (many unique values): Good for B-tree indexes
- Low selectivity (few unique values): Consider partial indexes
2. Query Patterns
- Equality (col = value): B-tree or Hash
- Range (col > value): B-tree only
- Pattern matching: GIN with pg_trgm for LIKE/ILIKE
- Full-text search: GIN with tsvector
3. Write Impact
- Indexes speed up SELECT but slow down INSERT/UPDATE/DELETE
- Consider index maintenance overhead for write-heavy tables
4. Size Considerations
- GIN indexes are larger but faster for complex searches
- BRIN indexes are very small but less precise
5. Common Mistakes
- Over-indexing: Adds overhead without benefit
- Wrong index type: e.g., B-tree for full-text search
- Wrong column order in multicolumn indexes
-- Create an index that "covers" a query (includes all needed columns)
CREATE INDEX idx_orders_customer_include_date ON orders(customer_id) INCLUDE (order_date, status);
-- Query that can be resolved entirely from the index (index-only scan)
SELECT order_date, status FROM orders WHERE customer_id = 123;
Performance Benefit: Eliminates table lookups, allowing index-only scans
-- Basic execution plan (estimation only)
EXPLAIN
SELECT c.customer_name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name;
-- Detailed execution plan with actual timing
EXPLAIN ANALYZE
SELECT c.customer_name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name;
-- Get plan in JSON format for programmatic analysis
EXPLAIN (FORMAT JSON, ANALYZE, BUFFERS)
SELECT c.customer_name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name;
Example EXPLAIN output with annotations:
EXPLAIN ANALYZE
SELECT c.customer_name, COUNT(o.order_id) AS order_count
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name;
# Output:
HashAggregate (cost=219.27..221.27 rows=200 width=40) (actual time=0.968..1.001 rows=200 loops=1)
# Groups by customer_name and computes count
Group Key: c.customer_name
-> Hash Join (cost=11.75..207.27 rows=1200 width=40) (actual time=0.208..0.771 rows=1200 loops=1)
# Joins customer and order tables using hash
Hash Cond: (o.customer_id = c.customer_id)
-> Seq Scan on orders o (cost=0.00..127.00 rows=1200 width=8) (actual time=0.011..0.316 rows=1200 loops=1)
# Full table scan on orders
-> Hash (cost=8.00..8.00 rows=300 width=40) (actual time=0.153..0.153 rows=300 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 22kB
-> Seq Scan on customers c (cost=0.00..8.00 rows=300 width=40) (actual time=0.009..0.077 rows=300 loops=1)
# Full table scan on customers
Planning Time: 0.273 ms
Execution Time: 1.050 ms
1. Index Only Scan - Retrieves data directly from index without table access
✓ Fastest possible scan when applicable
✓ Requires VACUUM to keep visibility map updated
2. Index Scan - Uses index to find rows, then fetches from table
✓ Good for retrieving small portion of table
✓ Efficient for high-selectivity conditions
3. Bitmap Index Scan - Builds bitmap of qualifying rows first
✓ Efficient for retrieving medium-sized result sets
✓ Can combine results from multiple indexes
4. Seq Scan - Reads entire table
✓ Best for retrieving large portions of a table
✓ When no useful index exists or optimizer decides full scan is faster
5. CTE Scan - Materializes a CTE before using it
✓ Can be beneficial for complex CTEs used multiple times
✓ But prevents some query optimizations
-- Issue: Missing join index
-- Solution: Create index on join column
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
-- Issue: Inefficient LIKE pattern
-- Solution: Use pg_trgm extension and GIN index
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_products_name_trgm ON products USING GIN (product_name gin_trgm_ops);
-- Now efficient: SELECT * FROM products WHERE product_name LIKE '%keyword%';
-- Issue: Full table scans on large tables
-- Solution: Add appropriate indexes and rewrite query
-- Before: WHERE EXTRACT(YEAR FROM date_column) = 2023
-- After: WHERE date_column BETWEEN '2023-01-01' AND '2023-12-31'
-- Issue: Slow aggregations
-- Solution: Use materialized views for frequently needed aggregates
CREATE MATERIALIZED VIEW monthly_sales AS
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_amount) AS monthly_revenue
FROM
orders
GROUP BY
DATE_TRUNC('month', order_date);
CREATE UNIQUE INDEX idx_monthly_sales_month ON monthly_sales(month);
-- Refresh as needed: REFRESH MATERIALIZED VIEW monthly_sales;
- Read the entire question before starting
- Identify tables needed and their relationships
- List required columns in the output
- Determine aggregations/groupings needed
- Sketch the query structure on paper/comment
-- Start with a simple structure that returns SOMETHING
SELECT customer_id, order_date
FROM orders
WHERE order_date > '2023-01-01';
-- Add complexity incrementally
SELECT
c.customer_name,
COUNT(o.order_id) AS order_count,
SUM(o.total_amount) AS total_spent
FROM
customers c
JOIN
orders o ON c.customer_id = o.customer_id
WHERE
o.order_date > '2023-01-01'
GROUP BY
c.customer_name;
- Run queries against small subsets first
- Use LIMIT to check initial results
- Test edge cases: NULL values, empty result sets, duplicates
- Review joins to ensure proper cardinality
1. Clearly state the options
2. Technical pros and cons:
- Performance considerations
- Scalability implications
- Maintainability aspects
- Security considerations
3. Business pros and cons:
- Cost implications
- Time to market
- User experience impact
- Future flexibility
4. Recommendation with clear rationale
1. Problem Statement:
"We needed to reduce API latency by 50% while handling 10x more traffic"
2. Constraints:
- No additional hardware budget
- Must maintain backward compatibility
- 99.99% uptime requirement
3. Approach:
- Profiled application to identify bottlenecks
- Implemented multi-level caching strategy
- Optimized database queries and added indexes
- Introduced connection pooling
4. Results:
- Achieved 70% latency reduction
- Successfully handled 20x traffic spike during holiday season
- Reduced infrastructure costs by 15%
1. Executive Level (Why it matters):
"This architecture reduces our time-to-market for new features by 40%"
2. Product Management Level (What it enables):
"This approach allows us to deploy new features independently without system-wide testing"
3. Technical Peer Level (How it works):
"We're using a message queue with dead-letter handling to ensure reliable event processing"
4. Junior Engineer Level (Implementation details):
"Let me walk through the event flow and error handling patterns we're using..."
Always connect technical decisions to business outcomes:
- Revenue: "This optimization increased conversion rate by 0.5%, adding $1M in annual revenue"
- Cost: "Migrating to serverless reduced infrastructure costs by $250K annually"
- Time: "The new CI/CD pipeline reduced release cycle from 2 weeks to 2 days"
- Quality: "Automated testing reduced production bugs by 60%"
- Risk: "Multi-region deployment reduced our disaster recovery time from 4 hours to 5 minutes"