Skip to content

Instantly share code, notes, and snippets.

@oleg-koval
Last active January 31, 2018 11:58
Show Gist options
  • Save oleg-koval/10ae3f8740472d689c28ed354e66aec2 to your computer and use it in GitHub Desktop.
Save oleg-koval/10ae3f8740472d689c28ed354e66aec2 to your computer and use it in GitHub Desktop.
big data technologies list

HORTONWORKS DATA PLATFORM

HDP is the industry's only true secure, enterprise-ready open source Apache™ Hadoop® distribution based on a centralized architecture (YARN). HDP addresses the complete needs of data-at-rest, powers real-time customer applications and delivers robust big data analytics that accelerate decision making and innovation.

YARN

(Not the same as yarn for node.js)

YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics. YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.

YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop.

HDFS

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject. The project URL is http://hadoop.apache.org/hdfs/.

Ambari

The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

Ranger

Comprehensive security for Enterprise Hadoop

Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides a centralised platform to define, administer and manage security policies consistently across Hadoop components. Apache Ranger offers a centralised security framework to manage fine-grained access control across:

  • Apache Hadoop HDFS
  • Apache Hive
  • Apache HBase
  • Apache Storm
  • Apache Knox
  • Apache Solr
  • Apache Kafka
  • Apache NiFi
  • YARN

ETL

ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse. ETL involves the following tasks:

extracting the data from source systems (SAP, ERP, other operational systems), data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing.

transforming the data may involve the following tasks:

  • applying business rules (so-called derivations, e.g., calculating new measures and dimensions),
  • cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.),
  • filtering (e.g., selecting only certain columns to load),
  • splitting a column into multiple columns and vice versa,
  • joining together data from multiple sources (e.g., lookup, merge),
  • transposing rows and columns,
  • applying any kind of simple or complex data validation (e.g., if the first 3 columns in a row are empty then reject the row from processing)

loading the data into a data warehouse or data repository other reporting applications

Spark

fast and general engine for large-scale data processing. Similar to mapreduce, aws lambda

AWS Lambda vs Apache Spark 2018 Comparison | StackShare

Kafka

Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.

Flink

Connector for Kafka

Airflow scheduler

The Airflow scheduler monitors all tasks and all DAGs (Directed acyclic graph), and triggers the task instances whose dependencies have been met. Behind the scenes, it monitors and stays in sync with a folder for all DAG objects it may contain, and periodically (every minute or so) inspects active tasks to see whether they can be triggered.

Openshift

OpenShift is a container application platform that brings docker and Kubernetes to the enterprise

OpenShift includes Kubernetes for container orchestration and management. OpenShift adds developer and operations-centric tools that enable:

  • Rapid application development
  • Easy deployment and scaling
  • Long-term life-cycle maintenance for teams and applications

Kubernetes

Real production apps span multiple containers. Those containers must be deployed across multiple server hosts. Kubernetes gives you the orchestration and management capabilities required to deploy containers, at scale, for these workloads. Kubernetes orchestration allows you to build application services that span multiple containers, schedule those containers across a cluster, scale those containers, and manage the health of those containers over time.

Kubernetes also needs to integrate with networking, storage, security, telemetry and other services to provide a comprehensive container infrastructure.

LDAP

Lightweight Directory Access Protocol The Lightweight Directory Access Protocol (LDAP; ˈɛldæp) is an open, vendor-neutral, industry standard application protocol for accessing and maintaining distributed directory information services over an Internet Protocol (IP) network.

Kerberos

Is a computer network authentication protocol that works on the basis of tickets to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner.

PAM

Linux Pluggable Authentication Modules (PAM) provide dynamic authentication support for applications and services in a Linux or GNU/kFreeBSD system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment