Created
November 15, 2024 07:45
-
-
Save alonsoir/ca9afe7f1fc8332f6bb4eda2624ee80b to your computer and use it in GitHub Desktop.
Visto en Linkedin, una refactorización de una consulta bastante grande.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
Query original: | |
SELECT | |
u.user_id,u.username,u.email, COALESCE(SUM(o.total_amount), 0) AS total_ spent, COUNT(DISTINCT o. order_id) AS total_orders, | |
ROUND (AVG(o. total_amount, 2) AS avg_order_value, | |
MAXo. total amount) AS max order_value, | |
RANK() OVER (ORDER BY SUM(o. total_amount) DESC) AS spending_rank, | |
SELECT MAX(02. order_date) | |
FROM orders 02 | |
WHERE02.userid=u.userid | |
) AS last_purchase date, | |
SELECT COUNT (DISTINCT 0i2. product id) | |
FROM orders 03 | |
JOIN order_items oi2 ON 03. order_id = oi2. order_id | |
WHERE 03.user id = u.user id | |
AND 03 order date >= NOW() - INTERVAL '30 days' | |
) AS products_bought last 30 days, | |
SELECT p. category | |
FROM order items oi | |
JOIN products p ON oi. product id = p. product_ id | |
WHERE oi. order_id = 0.order_id | |
GROUP BY p. category | |
ORDER BY COUNT (*) DESC | |
LIMIT 1 | |
) AS favorite category, lifetime_metrics.avg lifetime_value, | |
lifetime metrics.total lifetime value | |
FROM users u LEFT JOIN orders o ON u.user id = o.user_id LEFT JOIN order_items oi ON o. order_id = oi. order_ id | |
LEFT JOIN products p ON oi. product_id = p. product_ id LEFT JOIN [ SELECT user_id, SUM(total amount) AS total_lifetime value, ROUND (AVG (total amount), 2) | |
FROM orders | |
GROUP BY user id | |
• AS lifetime metrics ON u.user id = lifetime metrics.user_ id | |
GROUP BY | |
u.user_id, u.username, u.email, lifetime metrics.avg _lifetime value, lifetime metrics.total_lifetime value | |
ORDER BY | |
total_spent DESC, spending rank ASC; | |
No estoy seguro si la consulta es spark-sql o postgres, por lo que voy a poner ambas refactorizaciones. | |
--- SPARK-SQL | |
-- 1. Calcular las métricas de cada usuario a nivel de pedidos | |
WITH user_orders AS ( | |
SELECT | |
u.user_id, | |
u.username, | |
u.email, | |
COALESCE(SUM(o.total_amount), 0) AS total_spent, | |
COUNT(DISTINCT o.order_id) AS total_orders, | |
ROUND(AVG(o.total_amount), 2) AS avg_order_value, | |
MAX(o.total_amount) AS max_order_value, | |
RANK() OVER (ORDER BY COALESCE(SUM(o.total_amount), 0) DESC) AS spending_rank | |
FROM | |
users u | |
LEFT JOIN orders o ON u.user_id = o.user_id | |
GROUP BY | |
u.user_id, u.username, u.email | |
), | |
-- 2. Última fecha de compra por usuario | |
last_purchase AS ( | |
SELECT | |
user_id, | |
MAX(order_date) AS last_purchase_date | |
FROM | |
orders | |
GROUP BY | |
user_id | |
), | |
-- 3. Productos comprados en los últimos 30 días | |
recent_products AS ( | |
SELECT | |
o.user_id, | |
COUNT(DISTINCT oi.product_id) AS products_bought_last_30_days | |
FROM | |
orders o | |
JOIN order_items oi ON o.order_id = oi.order_id | |
WHERE | |
o.order_date >= NOW() - INTERVAL '30 days' | |
GROUP BY | |
o.user_id | |
), | |
-- 4. Categoría favorita de cada usuario | |
favorite_category AS ( | |
SELECT | |
o.user_id, | |
p.category AS favorite_category | |
FROM | |
orders o | |
JOIN order_items oi ON o.order_id = oi.order_id | |
JOIN products p ON oi.product_id = p.product_id | |
GROUP BY | |
o.user_id, p.category | |
ORDER BY | |
COUNT(*) DESC | |
LIMIT 1 | |
), | |
-- 5. Métricas de por vida del usuario | |
lifetime_metrics AS ( | |
SELECT | |
user_id, | |
SUM(total_amount) AS total_lifetime_value, | |
ROUND(AVG(total_amount), 2) AS avg_lifetime_value | |
FROM | |
orders | |
GROUP BY | |
user_id | |
) | |
-- 6. Combinar todos los resultados | |
SELECT | |
uo.user_id, | |
uo.username, | |
uo.email, | |
uo.total_spent, | |
uo.total_orders, | |
uo.avg_order_value, | |
uo.max_order_value, | |
uo.spending_rank, | |
lp.last_purchase_date, | |
rp.products_bought_last_30_days, | |
fc.favorite_category, | |
lm.avg_lifetime_value, | |
lm.total_lifetime_value | |
FROM | |
user_orders uo | |
LEFT JOIN last_purchase lp ON uo.user_id = lp.user_id | |
LEFT JOIN recent_products rp ON uo.user_id = rp.user_id | |
LEFT JOIN favorite_category fc ON uo.user_id = fc.user_id | |
LEFT JOIN lifetime_metrics lm ON uo.user_id = lm.user_id | |
ORDER BY | |
uo.total_spent DESC, | |
uo.spending_rank ASC; | |
--- | |
Ventajas de esta refactorización | |
Legibilidad: | |
Cada parte de la lógica se encapsula en una CTE (Common Table Expression), lo que facilita entender el propósito de cada bloque. | |
Los nombres de las columnas y tablas son consistentes y descriptivos. | |
Modularidad: | |
Si surge un problema en algún cálculo, puedes depurar únicamente el CTE relacionado. | |
Rendimiento: | |
En entornos distribuidos como Spark SQL, esta estructura permite optimizar consultas intermedias gracias a la evaluación diferida de Spark. | |
Se eliminan redundancias, como recalcular valores varias veces. | |
Facilidad de mantenimiento: | |
Agregar o modificar un cálculo solo requiere cambiar un bloque específico en lugar de tocar toda la consulta. | |
Reutilización: | |
Algunos bloques (como recent_products o lifetime_metrics) pueden usarse en otras consultas. | |
¿Qué sigue? | |
Si esta consulta aún es demasiado lenta, podemos explorar: | |
Indexar las tablas (si no estás en Spark, donde los índices no aplican directamente). | |
Materializar ciertas métricas para cálculos frecuentes. | |
Revisar las estadísticas de cardinalidad de los datos. | |
--- PostgresQL | |
-- 1. Calcular las métricas de cada usuario a nivel de pedidos | |
WITH user_orders AS ( | |
SELECT | |
u.user_id, | |
u.username, | |
u.email, | |
COALESCE(SUM(o.total_amount), 0) AS total_spent, | |
COUNT(DISTINCT o.order_id) AS total_orders, | |
ROUND(AVG(o.total_amount), 2) AS avg_order_value, | |
MAX(o.total_amount) AS max_order_value, | |
RANK() OVER (ORDER BY COALESCE(SUM(o.total_amount), 0) DESC) AS spending_rank | |
FROM | |
users u | |
LEFT JOIN orders o ON u.user_id = o.user_id | |
GROUP BY | |
u.user_id, u.username, u.email | |
), | |
-- 2. Última fecha de compra por usuario | |
last_purchase AS ( | |
SELECT | |
user_id, | |
MAX(order_date) AS last_purchase_date | |
FROM | |
orders | |
GROUP BY | |
user_id | |
), | |
-- 3. Productos comprados en los últimos 30 días | |
recent_products AS ( | |
SELECT | |
o.user_id, | |
COUNT(DISTINCT oi.product_id) AS products_bought_last_30_days | |
FROM | |
orders o | |
JOIN order_items oi ON o.order_id = oi.order_id | |
WHERE | |
o.order_date >= NOW() - INTERVAL '30 days' | |
GROUP BY | |
o.user_id | |
), | |
-- 4. Categoría favorita de cada usuario | |
favorite_category AS ( | |
SELECT | |
o.user_id, | |
p.category AS favorite_category | |
FROM | |
orders o | |
JOIN order_items oi ON o.order_id = oi.order_id | |
JOIN products p ON oi.product_id = p.product_id | |
GROUP BY | |
o.user_id, p.category | |
ORDER BY | |
o.user_id, COUNT(*) DESC | |
) | |
-- 5. Métricas de por vida del usuario | |
lifetime_metrics AS ( | |
SELECT | |
user_id, | |
SUM(total_amount) AS total_lifetime_value, | |
ROUND(AVG(total_amount), 2) AS avg_lifetime_value | |
FROM | |
orders | |
GROUP BY | |
user_id | |
) | |
-- 6. Combinar todos los resultados | |
SELECT | |
uo.user_id, | |
uo.username, | |
uo.email, | |
uo.total_spent, | |
uo.total_orders, | |
uo.avg_order_value, | |
uo.max_order_value, | |
uo.spending_rank, | |
lp.last_purchase_date, | |
rp.products_bought_last_30_days, | |
fc.favorite_category, | |
lm.avg_lifetime_value, | |
lm.total_lifetime_value | |
FROM | |
user_orders uo | |
LEFT JOIN last_purchase lp ON uo.user_id = lp.user_id | |
LEFT JOIN recent_products rp ON uo.user_id = rp.user_id | |
LEFT JOIN favorite_category fc ON uo.user_id = fc.user_id | |
LEFT JOIN lifetime_metrics lm ON uo.user_id = lm.user_id | |
ORDER BY | |
uo.total_spent DESC, | |
uo.spending_rank ASC; | |
--- | |
Explicación de cada CTE en PostgreSQL | |
user_orders: Esta CTE calcula las métricas principales de gastos del usuario (total, promedio y máximo de gasto, total de pedidos, y ranking de gasto). | |
last_purchase: Obtiene la última fecha de compra de cada usuario. | |
recent_products: Cuenta los productos únicos comprados por cada usuario en los últimos 30 días. | |
favorite_category: Determina la categoría favorita de cada usuario basada en la frecuencia de compra. | |
lifetime_metrics: Calcula las métricas de por vida del usuario, como el gasto total y el gasto promedio de por vida. | |
Consideraciones de Rendimiento | |
Para mejorar el rendimiento en PostgreSQL: | |
Indices: Asegúrate de que user_id, order_id y product_id tengan índices. Además, un índice en order_date podría optimizar la CTE recent_products. | |
Materializar resultados frecuentes: Si estas métricas se consultan regularmente, considera crear una vista materializada para la consulta completa o algunos CTEs. | |
Ajustes de VACUUM y ANALYZE: Ejecutar estos comandos con regularidad mejorará la eficiencia del planificador de consultas. | |
--- |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment