This document explains how the main components of the Hadoop-based Big Data ecosystem connect and work together: Hue, Hive, Impala, HDFS, Spark (PySpark), HBase, Iceberg, Parquet, ORC, Oozie, and Teradata.
| Component | Type | Purpose |
|---|
| Original Text (on page) | Issue | Suggested Correction |
|---|---|---|
| “List sroted pods” | Typo | “List sorted pods” |
| “List pods using a different output” | Wording unclear | Could be: “List pods with different output formats” |
| “View all cotainers logs…” | Typo | “View all containers logs…” |
| “locahost-port” | Typo | “localhost-port” |
| “hosts-port” | Wording | Better: “host-port” |
| **“ |
| flowchart TD | |
| %% ============================= | |
| %% Global Layout Tweaks | |
| %% ============================= | |
| %% Make arrows thicker and more visible | |
| linkStyle default stroke-width:2px,stroke:#555,opacity:0.9; | |
| -- CREATE EXTENSION IF NOT EXISTS postgis; | |
| -- CREATE EXTENSION IF NOT EXISTS pgvector; | |
| ------------------------------ Exercise 1 ------------------------------ | |
| -- Table setup | |
| CREATE TABLE products ( | |
| id SERIAL PRIMARY KEY, | |
| sku VARCHAR(20) UNIQUE, | |
| name VARCHAR(200), | |
| category VARCHAR(50), |
| Search Type | Speed | Accuracy | Flexibility | Storage | Best For |
|---|---|---|---|---|---|
| Exact Match | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐⭐⭐ | IDs, codes, filters |
| Pattern Match | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | Autocomplete, prefixes |
| Full-Text | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | Documents, articles |
| Vector / Semantic | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐ | Recommendations, concepts |
| Fuzzy | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ | Typos, data cleaning |
| Scenario | Best Choice | Alternative / Avoid |
|---|---|---|
| User login/auth | Exact Match | All others |
| Product SKU lookup | Exact Match | All others |
| Autocomplete | Pattern Match (prefix) | Fuzzy, Vector |
| Blog search | Full-Text | Vector + Full-Text, Pattern |
| Recommendation | Vector + Full-Text | Pattern |
| Exact Data with typos | Fuzzy | Pattern, Exact |
| Multi-language content | Vector + Full-Text | Pattern |
| Real-time search | Exact / Pattern | Full-Text, Vector |
| Search Type | Additional Storage | Index Size | Notes |
|---|---|---|---|
| Exact Match | None | ~2–5% of data | B-tree indexes |
| Pattern Match | None | ~10–20% | GIN trigram indexes |
| Full-Text | ~20–50% | ~10–30% | tsvector + GIN index |
| Vector Search | ~50–200% | ~20–100% | Depends on dimensions |
| Fuzzy Search | None | ~10–20% | Uses trigram indexes |
| Analyzer | Description | Example Use |
|---|---|---|
| Standard | Default; breaks text by word boundaries, removes most punctuation, lowercases tokens. | English prose, general search |
| Simple | Splits on non-letter, lowercases. | Part numbers, technical terms |
| Whitespace | Splits on whitespace only, preserves case. | Code, serial numbers |
| Keyword | Does not split; treats entire text as a single token. | Exact match fields, IDs, tags |
| Feature | TF-IDF | BM25 |
|---|---|---|
| Full Form | Term Frequency – Inverse Document Frequency | Best Matching 25 |
| Default in Elasticsearch | ❌ (before v5.0) | ✅ (v5.0 and later) |
| Term Frequency Handling | Linear | Saturated (diminishing returns) |
| Document Length Normalization | Minimal | Tunable and robust |
| Tunable Parameters | No | Yes (k1, b) |
| Use Case | TF-IDF | BM25 |
|---|---|---|
| Simple scoring model | ✅ | ✅ |
| Accurate relevance for modern search | ❌ | ✅ |
| Normalize for document length | ❌ | ✅ |
| Tune scoring behavior with parameters | ❌ | ✅ |