Is an open-source software framework for storage and large scale processing of data-sets on cluster of commodity hardware.
- Scalability
- Realibility
Keep all the data to raw format and use schema on reading style.
- Hadoop Common: libraries and utilities to other hadoop modules
- Hadoop Distributed File System (HDFS): distributed file system that stores data.
- Hadoop Map Reduce: programing model
- Hadoop YARN: resource managemnt platform responsible for managing compute resources in the cluster
Distributed, scalable and portable file-system written in Java for the Hadoop framework.
Tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Is a key component of the Hadoop stack
- Column-oriented database management
- Key-value store
- Based on Google Big Table
- Can hold extremely large data
- Dynamic data model
- Not a relational DRMS
It's a scripting language
- High level programming on top of Hadoop MapReduce
- Pig Latin
- Data analysis problems as data flows
- Originally developed at Yahoo in 2006
UDF: User defined functions
Data warehouse software facilitates quering and managing large datasets residing in distribute storage.
- Workflow scheduler system to manage Apache Hadoop jobs
- oozie Coordinator jobs
- Supports: MapReduce, Pig, Apache, Hive and Sqoop
- Provides operational services for a Hadoop cluster
- Centralized service for maintaining configuration information, naming, provising distributed syncronization and providing
- Distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of log data.
Is Cloudera's open source massively parallel processing (MPP) SQL
Is a fast and general engine for large-scale data processing
- Multi-stage in-memory primitives provides performance up to 100 times faster for certain applications
- Allows users prograns to load data into a cluster's memory and query it repeatedly
- Well-suited to machine learning