This document explains how the main components of the Hadoop-based Big Data ecosystem connect and work together: Hue, Hive, Impala, HDFS, Spark (PySpark), HBase, Iceberg, Parquet, ORC, Oozie, and Teradata.
| Component | Type | Purpose |
|---|---|---|
| HDFS | Storage | Distributed file system for storing large datasets across cluster nodes. |
| Hive | SQL Engine | Batch SQL engine for data warehousing and ETL. |
| Impala | SQL Engine | Lowโlatency, interactive SQL engine for analytics. |
| Hue | Web UI | Web-based interface to interact with Hadoop ecosystem tools. |
| Spark / PySpark | Processing Engine | Distributed computing engine for data processing, ETL, ML, analytics. |
| Iceberg | Table Format | Modern table layer adding ACID transactions, schema evolution, time travel. |
| HBase | NoSQL Database | Real-time read/write access to data stored on HDFS. |
| Parquet / ORC | File Formats | Columnar storage formats optimized for big data analytics. |
| Oozie | Workflow Scheduler | Orchestrates and schedules Hadoop ecosystem jobs (Hive, Spark, etc.). |
| Teradata | Enterprise Data Warehouse | MPP SQL data warehouse for large-scale structured analytics. |
โโโโโโโโโโโโโโโโโโโโโโ
โ Hue โ
โ (Web UI for Hadoop)โ
โ - Run Hive/Impala โ
โ - Browse HDFS/HBaseโ
โโโโโโโโโโโฌโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโผโโโโโโโโโโ โโโโโโโโโผโโโโโโโโโโ
โ Hive โ โ Impala โ
โ (Batch SQL Engine)โ โ (Fast SQL Engine)โ
โ - ETL / Warehousingโ โ - Interactive BI โ
โ - Works with Oozie โ โ โ
โโโโโโโโโโฌโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ
โ โ
โ (Both use the same Hive Metastore) โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโผโโโโโโโโ
โ Iceberg โ
โ Modern Table โ
โ Format โ
โ (ACID, schemaโ
โ evolution) โ
โโโโโโโโฌโโโโโโโโ
โ
โโโโโโโโโผโโโโโโโโโ
โ HDFS โ
โ (Data Storage) โ
โโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโผโโโโโโโโ โโโโโโโโโโผโโโโโโโโโโ
โ PySpark โ โ HBase โ
โ (Compute / ETL)โ โ (NoSQL, Real-time)โ
โ - Reads/Writes โ โ - Random R/W โ
โ to HDFS/Iceberg/Hive โ - Built on HDFS โ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโ
โ Teradata โ
โ Enterprise Data Warehouse โ
โ (Structured large-scale SQL)โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
HDFS (Hadoop Distributed File System) is the foundation where all raw and processed data is stored.
- Stores data as blocks across cluster nodes.
- Provides fault tolerance and high throughput.
- Other components (Hive, Impala, Spark, HBase) read/write from it.
- SQL interface for data stored in HDFS.
- Converts queries into MapReduce / Tez / Spark jobs for batch processing.
- Ideal for ETL and large-scale aggregations.
- Defines table metadata in the Hive Metastore.
- Real-time SQL query engine.
- Reads the same tables defined in Hive Metastore.
- Executes queries directly in memory for low-latency analytics.
- Best for BI dashboards and interactive analysis.
-
Distributed computation engine for batch, streaming, and ML.
-
PySpark provides a Python API for Spark.
-
Can read/write from:
- HDFS
- Hive tables
- Iceberg tables
- Parquet/ORC files
-
Often used for ETL, data transformation, or feature engineering.
- Orchestrates complex workflows and schedules recurring jobs in the Hadoop ecosystem.
- Coordinates jobs such as Hive scripts, Spark jobs, shell commands, HDFS actions.
- Enables dependency management, error-handling, and scheduling (time or data triggers).
Apache Iceberg improves upon Hiveโs traditional table format.
โ Features:
- ACID transactions
- Schema evolution (add/remove columns safely)
- Hidden partitioning
- Time travel (query older snapshots)
Integration:
- Stored on HDFS or cloud storage.
- Accessible via Hive, Spark, Impala, Presto, etc.
| Feature | Parquet | ORC |
|---|---|---|
| Type | Columnar | Columnar |
| Developed By | Twitter & Cloudera | Hortonworks |
| Optimized For | Spark, Impala, Iceberg | Hive, Spark |
| Compression | Snappy, GZIP, LZO, ZSTD | ZLIB, Snappy |
| Read Performance | Excellent for Spark & Impala | Excellent for Hive |
| Write Performance | Faster | Slower, but better compression |
| Metadata | Stored per column chunk | Rich stats per stripe (min/max, bloom filters) |
| File Extension | .parquet |
.orc |
Example:
# PySpark Example
df = spark.read.csv("data.csv", header=True)
df.write.parquet("data_parquet/")
df.write.orc("data_orc/")-
Built on HDFS but optimized for random read/write.
-
Schema-flexible: stores data as key-value pairs.
-
Suitable for:
- IoT sensor data
- Time-series logs
- Real-time lookup tables
-
Accessible via:
- APIs (Java, Python, REST)
- Apache Phoenix for SQL queries
- Hue HBase Browser
-
Highly scalable MPP (Massively Parallel Processing) SQL data warehouse for large-scale structured analytics.
-
Often used as the enterprise โsingle source of truthโ for curated, cleaned data.
-
Works alongside Hadoop ecosystems:
- Raw data processed in Hadoop โ summaries/load to Teradata
- BI tools connect to Teradata for stable reporting.
-
Architecture highlights:
- Shared-nothing architecture
- Parsing Engine (PE) for SQL queries
- AMP (Access Module Processor) nodes for parallel data processing
- BYNET network for high-speed inter-node communication
Hue (Hadoop User Experience) provides a web-based UI for:
- Writing and running Hive or Impala SQL queries.
- Browsing and managing HDFS files.
- Viewing HBase tables.
- Managing Spark jobs and workflows (including Oozie workflows).
- Data lands in HDFS from logs or ingestion.
- Oozie triggers jobs (e.g., nightly at 02:00).
- PySpark cleans and transforms the data, writes output to Iceberg/Hive tables.
- Results are stored in Parquet or ORC format.
- Hive runs scheduled ETL aggregations.
- Impala enables interactive analytics on that data.
- Teradata receives curated, structured data for enterprise reporting.
- HBase serves real-time data lookups or time-series.
- Hue is used by analysts/engineers to run queries, monitor jobs, browse data, and visualize results.
| Layer | Technology | Role |
|---|---|---|
| UI | Hue | Web-based UI for querying and data exploration |
| Schedule | Oozie | Orchestrates and schedules workflows in Hadoop |
| SQL Engines | Hive / Impala | Batch and interactive SQL processing |
| Processing | Spark / PySpark | Distributed data processing & ETL |
| Storage Format | Iceberg | Table management with ACID and schema evolution |
| File Formats | Parquet / ORC | Compressed columnar storage for analytics |
| Storage | HDFS | Distributed, fault-tolerant storage layer |
| NoSQL | HBase | Real-time read/write data store |
| DW | Teradata | Enterprise data warehouse for large-scale SQL analytics |
HDFS stores โ Iceberg/Hive organize โ Impala/Hive analyze โ PySpark processes โ HBase serves realโtime โ Teradata warehouses curated data โ Oozie orchestrates the jobs โ Hue visualizes and manages it all.