Skip to content

Instantly share code, notes, and snippets.

@sany2k8
sany2k8 / Big Data Ecosystem Overview.md
Last active November 7, 2025 08:45
A comprehensive Markdown file that documents the end-to-end Big Data ecosystem workflowm, including Hue, Hive, Impala, HDFS, Spark (PySpark), HBase, Iceberg, and file formats (Parquet, ORC).

🧭 Big Data Ecosystem Overview

This document explains how the main components of the Hadoop-based Big Data ecosystem connect and work together: Hue, Hive, Impala, HDFS, Spark (PySpark), HBase, Iceberg, Parquet, ORC, Oozie, and Teradata.


🧱 Core Components and Their Roles

Component Type Purpose

Typos & Corrections

Original Text (on page) Issue Suggested Correction
“List sroted pods” Typo “List sorted pods”
“List pods using a different output” Wording unclear Could be: “List pods with different output formats”
“View all cotainers logs…” Typo “View all containers logs…”
“locahost-port” Typo “localhost-port”
“hosts-port” Wording Better: “host-port”
**“
flowchart TD
%% =============================
%% Global Layout Tweaks
%% =============================
%% Make arrows thicker and more visible
linkStyle default stroke-width:2px,stroke:#555,opacity:0.9;
@sany2k8
sany2k8 / search_types.sql
Last active August 25, 2025 17:19
All the working queries
-- CREATE EXTENSION IF NOT EXISTS postgis;
-- CREATE EXTENSION IF NOT EXISTS pgvector;
------------------------------ Exercise 1 ------------------------------
-- Table setup
CREATE TABLE products (
id SERIAL PRIMARY KEY,
sku VARCHAR(20) UNIQUE,
name VARCHAR(200),
category VARCHAR(50),
Search Type Speed Accuracy Flexibility Storage Best For
Exact Match ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ IDs, codes, filters
Pattern Match ⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐ Autocomplete, prefixes
Full-Text ⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐ Documents, articles
Vector / Semantic ⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐ Recommendations, concepts
Fuzzy ⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐⭐ Typos, data cleaning
Scenario Best Choice Alternative / Avoid
User login/auth Exact Match All others
Product SKU lookup Exact Match All others
Autocomplete Pattern Match (prefix) Fuzzy, Vector
Blog search Full-Text Vector + Full-Text, Pattern
Recommendation Vector + Full-Text Pattern
Exact Data with typos Fuzzy Pattern, Exact
Multi-language content Vector + Full-Text Pattern
Real-time search Exact / Pattern Full-Text, Vector
Search Type Additional Storage Index Size Notes
Exact Match None ~2–5% of data B-tree indexes
Pattern Match None ~10–20% GIN trigram indexes
Full-Text ~20–50% ~10–30% tsvector + GIN index
Vector Search ~50–200% ~20–100% Depends on dimensions
Fuzzy Search None ~10–20% Uses trigram indexes

Types of Analyzers

Analyzer Description Example Use
Standard Default; breaks text by word boundaries, removes most punctuation, lowercases tokens. English prose, general search
Simple Splits on non-letter, lowercases. Part numbers, technical terms
Whitespace Splits on whitespace only, preserves case. Code, serial numbers
Keyword Does not split; treats entire text as a single token. Exact match fields, IDs, tags
Feature TF-IDF BM25
Full Form Term Frequency – Inverse Document Frequency Best Matching 25
Default in Elasticsearch ❌ (before v5.0) ✅ (v5.0 and later)
Term Frequency Handling Linear Saturated (diminishing returns)
Document Length Normalization Minimal Tunable and robust
Tunable Parameters No Yes (k1, b)
Use Case TF-IDF BM25
Simple scoring model
Accurate relevance for modern search
Normalize for document length
Tune scoring behavior with parameters