Md. Sany Ahmed sany2k8

🧭 Big Data Ecosystem Overview

This document explains how the main components of the Hadoop-based Big Data ecosystem connect and work together: Hue, Hive, Impala, HDFS, Spark (PySpark), HBase, Iceberg, Parquet, ORC, Oozie, and Teradata.

🧱 Core Components and Their Roles

Component	Type	Purpose

Typos & Corrections

Original Text (on page)	Issue	Suggested Correction
“List sroted pods”	Typo	“List sorted pods”
“List pods using a different output”	Wording unclear	Could be: “List pods with different output formats”
“View all cotainers logs…”	Typo	“View all containers logs…”
“locahost-port”	Typo	“localhost-port”
“hosts-port”	Wording	Better: “host-port”
**“

Search Type	Speed	Accuracy	Flexibility	Storage	Best For
Exact Match	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐	⭐⭐⭐⭐⭐	IDs, codes, filters
Pattern Match	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	Autocomplete, prefixes
Full-Text	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	Documents, articles
Vector / Semantic	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐	Recommendations, concepts
Fuzzy	⭐⭐	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	Typos, data cleaning

Scenario	Best Choice	Alternative / Avoid
User login/auth	Exact Match	All others
Product SKU lookup	Exact Match	All others
Autocomplete	Pattern Match (prefix)	Fuzzy, Vector
Blog search	Full-Text	Vector + Full-Text, Pattern
Recommendation	Vector + Full-Text	Pattern
Exact Data with typos	Fuzzy	Pattern, Exact
Multi-language content	Vector + Full-Text	Pattern
Real-time search	Exact / Pattern	Full-Text, Vector

Search Type	Additional Storage	Index Size	Notes
Exact Match	None	~2–5% of data	B-tree indexes
Pattern Match	None	~10–20%	GIN trigram indexes
Full-Text	~20–50%	~10–30%	tsvector + GIN index
Vector Search	~50–200%	~20–100%	Depends on dimensions
Fuzzy Search	None	~10–20%	Uses trigram indexes

Types of Analyzers

Analyzer	Description	Example Use
Standard	Default; breaks text by word boundaries, removes most punctuation, lowercases tokens.	English prose, general search
Simple	Splits on non-letter, lowercases.	Part numbers, technical terms
Whitespace	Splits on whitespace only, preserves case.	Code, serial numbers
Keyword	Does not split; treats entire text as a single token.	Exact match fields, IDs, tags

Feature	TF-IDF	BM25
Full Form	Term Frequency – Inverse Document Frequency	Best Matching 25
Default in Elasticsearch	❌ (before v5.0)	✅ (v5.0 and later)
Term Frequency Handling	Linear	Saturated (diminishing returns)
Document Length Normalization	Minimal	Tunable and robust
Tunable Parameters	No	Yes (`k1`, `b`)

Use Case	TF-IDF	BM25
Simple scoring model	✅	✅
Accurate relevance for modern search	❌	✅
Normalize for document length	❌	✅
Tune scoring behavior with parameters	❌	✅

	flowchart TD

	%% =============================
	%% Global Layout Tweaks
	%% =============================

	%% Make arrows thicker and more visible
	linkStyle default stroke-width:2px,stroke:#555,opacity:0.9;

	-- CREATE EXTENSION IF NOT EXISTS postgis;
	-- CREATE EXTENSION IF NOT EXISTS pgvector;

	------------------------------ Exercise 1 ------------------------------
	-- Table setup
	CREATE TABLE products (
	id SERIAL PRIMARY KEY,
	sku VARCHAR(20) UNIQUE,
	name VARCHAR(200),
	category VARCHAR(50),