Skip to content

Instantly share code, notes, and snippets.

@alonsoir
Created November 15, 2024 07:45
Show Gist options
  • Save alonsoir/ca9afe7f1fc8332f6bb4eda2624ee80b to your computer and use it in GitHub Desktop.
Save alonsoir/ca9afe7f1fc8332f6bb4eda2624ee80b to your computer and use it in GitHub Desktop.
Visto en Linkedin, una refactorización de una consulta bastante grande.
---
Query original:
SELECT
u.user_id,u.username,u.email, COALESCE(SUM(o.total_amount), 0) AS total_ spent, COUNT(DISTINCT o. order_id) AS total_orders,
ROUND (AVG(o. total_amount, 2) AS avg_order_value,
MAXo. total amount) AS max order_value,
RANK() OVER (ORDER BY SUM(o. total_amount) DESC) AS spending_rank,
SELECT MAX(02. order_date)
FROM orders 02
WHERE02.userid=u.userid
) AS last_purchase date,
SELECT COUNT (DISTINCT 0i2. product id)
FROM orders 03
JOIN order_items oi2 ON 03. order_id = oi2. order_id
WHERE 03.user id = u.user id
AND 03 order date >= NOW() - INTERVAL '30 days'
) AS products_bought last 30 days,
SELECT p. category
FROM order items oi
JOIN products p ON oi. product id = p. product_ id
WHERE oi. order_id = 0.order_id
GROUP BY p. category
ORDER BY COUNT (*) DESC
LIMIT 1
) AS favorite category, lifetime_metrics.avg lifetime_value,
lifetime metrics.total lifetime value
FROM users u LEFT JOIN orders o ON u.user id = o.user_id LEFT JOIN order_items oi ON o. order_id = oi. order_ id
LEFT JOIN products p ON oi. product_id = p. product_ id LEFT JOIN [ SELECT user_id, SUM(total amount) AS total_lifetime value, ROUND (AVG (total amount), 2)
FROM orders
GROUP BY user id
• AS lifetime metrics ON u.user id = lifetime metrics.user_ id
GROUP BY
u.user_id, u.username, u.email, lifetime metrics.avg _lifetime value, lifetime metrics.total_lifetime value
ORDER BY
total_spent DESC, spending rank ASC;
No estoy seguro si la consulta es spark-sql o postgres, por lo que voy a poner ambas refactorizaciones.
--- SPARK-SQL
-- 1. Calcular las métricas de cada usuario a nivel de pedidos
WITH user_orders AS (
SELECT
u.user_id,
u.username,
u.email,
COALESCE(SUM(o.total_amount), 0) AS total_spent,
COUNT(DISTINCT o.order_id) AS total_orders,
ROUND(AVG(o.total_amount), 2) AS avg_order_value,
MAX(o.total_amount) AS max_order_value,
RANK() OVER (ORDER BY COALESCE(SUM(o.total_amount), 0) DESC) AS spending_rank
FROM
users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY
u.user_id, u.username, u.email
),
-- 2. Última fecha de compra por usuario
last_purchase AS (
SELECT
user_id,
MAX(order_date) AS last_purchase_date
FROM
orders
GROUP BY
user_id
),
-- 3. Productos comprados en los últimos 30 días
recent_products AS (
SELECT
o.user_id,
COUNT(DISTINCT oi.product_id) AS products_bought_last_30_days
FROM
orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE
o.order_date >= NOW() - INTERVAL '30 days'
GROUP BY
o.user_id
),
-- 4. Categoría favorita de cada usuario
favorite_category AS (
SELECT
o.user_id,
p.category AS favorite_category
FROM
orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY
o.user_id, p.category
ORDER BY
COUNT(*) DESC
LIMIT 1
),
-- 5. Métricas de por vida del usuario
lifetime_metrics AS (
SELECT
user_id,
SUM(total_amount) AS total_lifetime_value,
ROUND(AVG(total_amount), 2) AS avg_lifetime_value
FROM
orders
GROUP BY
user_id
)
-- 6. Combinar todos los resultados
SELECT
uo.user_id,
uo.username,
uo.email,
uo.total_spent,
uo.total_orders,
uo.avg_order_value,
uo.max_order_value,
uo.spending_rank,
lp.last_purchase_date,
rp.products_bought_last_30_days,
fc.favorite_category,
lm.avg_lifetime_value,
lm.total_lifetime_value
FROM
user_orders uo
LEFT JOIN last_purchase lp ON uo.user_id = lp.user_id
LEFT JOIN recent_products rp ON uo.user_id = rp.user_id
LEFT JOIN favorite_category fc ON uo.user_id = fc.user_id
LEFT JOIN lifetime_metrics lm ON uo.user_id = lm.user_id
ORDER BY
uo.total_spent DESC,
uo.spending_rank ASC;
---
Ventajas de esta refactorización
Legibilidad:
Cada parte de la lógica se encapsula en una CTE (Common Table Expression), lo que facilita entender el propósito de cada bloque.
Los nombres de las columnas y tablas son consistentes y descriptivos.
Modularidad:
Si surge un problema en algún cálculo, puedes depurar únicamente el CTE relacionado.
Rendimiento:
En entornos distribuidos como Spark SQL, esta estructura permite optimizar consultas intermedias gracias a la evaluación diferida de Spark.
Se eliminan redundancias, como recalcular valores varias veces.
Facilidad de mantenimiento:
Agregar o modificar un cálculo solo requiere cambiar un bloque específico en lugar de tocar toda la consulta.
Reutilización:
Algunos bloques (como recent_products o lifetime_metrics) pueden usarse en otras consultas.
¿Qué sigue?
Si esta consulta aún es demasiado lenta, podemos explorar:
Indexar las tablas (si no estás en Spark, donde los índices no aplican directamente).
Materializar ciertas métricas para cálculos frecuentes.
Revisar las estadísticas de cardinalidad de los datos.
--- PostgresQL
-- 1. Calcular las métricas de cada usuario a nivel de pedidos
WITH user_orders AS (
SELECT
u.user_id,
u.username,
u.email,
COALESCE(SUM(o.total_amount), 0) AS total_spent,
COUNT(DISTINCT o.order_id) AS total_orders,
ROUND(AVG(o.total_amount), 2) AS avg_order_value,
MAX(o.total_amount) AS max_order_value,
RANK() OVER (ORDER BY COALESCE(SUM(o.total_amount), 0) DESC) AS spending_rank
FROM
users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY
u.user_id, u.username, u.email
),
-- 2. Última fecha de compra por usuario
last_purchase AS (
SELECT
user_id,
MAX(order_date) AS last_purchase_date
FROM
orders
GROUP BY
user_id
),
-- 3. Productos comprados en los últimos 30 días
recent_products AS (
SELECT
o.user_id,
COUNT(DISTINCT oi.product_id) AS products_bought_last_30_days
FROM
orders o
JOIN order_items oi ON o.order_id = oi.order_id
WHERE
o.order_date >= NOW() - INTERVAL '30 days'
GROUP BY
o.user_id
),
-- 4. Categoría favorita de cada usuario
favorite_category AS (
SELECT
o.user_id,
p.category AS favorite_category
FROM
orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY
o.user_id, p.category
ORDER BY
o.user_id, COUNT(*) DESC
)
-- 5. Métricas de por vida del usuario
lifetime_metrics AS (
SELECT
user_id,
SUM(total_amount) AS total_lifetime_value,
ROUND(AVG(total_amount), 2) AS avg_lifetime_value
FROM
orders
GROUP BY
user_id
)
-- 6. Combinar todos los resultados
SELECT
uo.user_id,
uo.username,
uo.email,
uo.total_spent,
uo.total_orders,
uo.avg_order_value,
uo.max_order_value,
uo.spending_rank,
lp.last_purchase_date,
rp.products_bought_last_30_days,
fc.favorite_category,
lm.avg_lifetime_value,
lm.total_lifetime_value
FROM
user_orders uo
LEFT JOIN last_purchase lp ON uo.user_id = lp.user_id
LEFT JOIN recent_products rp ON uo.user_id = rp.user_id
LEFT JOIN favorite_category fc ON uo.user_id = fc.user_id
LEFT JOIN lifetime_metrics lm ON uo.user_id = lm.user_id
ORDER BY
uo.total_spent DESC,
uo.spending_rank ASC;
---
Explicación de cada CTE en PostgreSQL
user_orders: Esta CTE calcula las métricas principales de gastos del usuario (total, promedio y máximo de gasto, total de pedidos, y ranking de gasto).
last_purchase: Obtiene la última fecha de compra de cada usuario.
recent_products: Cuenta los productos únicos comprados por cada usuario en los últimos 30 días.
favorite_category: Determina la categoría favorita de cada usuario basada en la frecuencia de compra.
lifetime_metrics: Calcula las métricas de por vida del usuario, como el gasto total y el gasto promedio de por vida.
Consideraciones de Rendimiento
Para mejorar el rendimiento en PostgreSQL:
Indices: Asegúrate de que user_id, order_id y product_id tengan índices. Además, un índice en order_date podría optimizar la CTE recent_products.
Materializar resultados frecuentes: Si estas métricas se consultan regularmente, considera crear una vista materializada para la consulta completa o algunos CTEs.
Ajustes de VACUUM y ANALYZE: Ejecutar estos comandos con regularidad mejorará la eficiencia del planificador de consultas.
---
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment