Skip to content

Instantly share code, notes, and snippets.

@septicwolf818
Created November 27, 2024 14:14
Show Gist options
  • Save septicwolf818/26e263cd59fd23e61f99a95f25a658d2 to your computer and use it in GitHub Desktop.
Save septicwolf818/26e263cd59fd23e61f99a95f25a658d2 to your computer and use it in GitHub Desktop.
Apache Projects - Big Data

Data Storage

  • Apache Hadoop - Foundational for Big Data ecosystems, offers HDFS for distributed storage.
  • Apache HBase - Distributed, scalable, NoSQL database atop HDFS.
  • Apache Cassandra - Highly scalable NoSQL database for large data workloads.
  • Apache Accumulo - Secure, distributed key-value store.
  • Apache Kudu - Columnar storage for analytics.
  • Apache Parquet - Columnar storage file format optimized for Big Data.
  • Apache ORC - Optimized row-columnar file format for Big Data.
  • Apache Arrow - In-memory columnar data storage for analytics.

Data Processing

  • Apache Spark - Unified analytics engine for large-scale data processing.
  • Apache Flink - Stream and batch processing engine.
  • Apache Beam - Unified model for stream and batch processing.
  • Apache Hive - SQL-like querying on Big Data.
  • Apache Pig - High-level platform for processing large datasets.
  • Apache Tez - Framework for data processing workflows.
  • Apache Storm - Real-time computation system.
  • Apache NiFi - Data integration and processing automation.

Data Management & Governance

  • Apache Ranger - Security and access control for Big Data.
  • Apache Atlas - Data governance and metadata management.
  • Apache Ambari - Management platform for Hadoop clusters.
  • Apache ZooKeeper - Coordination service for distributed systems.

Query Engines

  • Apache Drill - SQL query engine for heterogeneous data.
  • Apache Impala - Interactive SQL for Hadoop.
  • Apache Druid - Real-time analytics database.
  • Apache Kylin - OLAP engine for multi-dimensional analytics.
  • Apache Phoenix - SQL on HBase.

Streaming & Messaging

  • Apache Kafka - Distributed event streaming platform.
  • Apache Pulsar - Cloud-native distributed messaging and streaming.
  • Apache Samza - Stream processing framework.

Machine Learning & AI Integration

  • Apache Mahout - Scalable machine learning.
  • Apache MADlib - Machine learning library for SQL.
  • Apache SystemDS - Distributed machine learning.
  • Apache MXNet (in the Attic) - Deep learning framework.

Data Workflow Orchestration

  • Apache Airflow - Workflow automation and scheduling.
  • Apache DolphinScheduler - Distributed workflow orchestration.

Big Data Ecosystem Enhancements

  • Apache Oozie - Workflow scheduler for Hadoop.
  • Apache Bigtop - Packaging and integration of Big Data components.

Graph Processing

  • Apache TinkerPop - Graph computing framework.
  • Apache Giraph (in the Attic) - Large-scale graph processing.

Geospatial & Specialized Processing

  • Apache Sedona - Spatial data processing engine.
  • Apache Hudi - Data lake management.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment