Skip to content

Instantly share code, notes, and snippets.

@nsabharwal
Last active October 13, 2015 13:42
Show Gist options
  • Save nsabharwal/22fffc99d80f997d9bed to your computer and use it in GitHub Desktop.
Save nsabharwal/22fffc99d80f997d9bed to your computer and use it in GitHub Desktop.
Big Data product list and short description
Sqoop : tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Spark : fast and general engine for large-scale data processing. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing
CouchBase : open source, distributed NoSQL document-oriented database. It exposes a fast key-value store with managed cache for submillisecond data operations, purpose-built indexers for fast queries and a query engine for executing SQL queries
Jupyter: Web application that allows to create and share docs that contain live code, equations, visualizations and explanatory text.Use case: Data cleaning, transformation, numerical simulation, statistical modeling, ML and more
H20 : H2O is for data scientists and business analysts who need scalable and fast machine learning.It is an open source predictive analytics platform.use case: Ad, fraud detection, predictive modeling, customer intelligence
Tachyon : Tachyon is a memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks
Flink : Open source platform for distributed stream and batch data processing. Core is streaming data flow engine.
Drill : Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage. Drill has a storage plugin for Hive tables, so you can simply point Drill to the Hive Metastore and start performing low-latency queries on Hive tables. In fact, a single Drill cluster can query data from multiple Hive Metastores, and even perform joins across these datasets.
Solr : Solr powers some of the most heavily-trafficked websites and applications in the world.
Elastic Search :Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java
Hive : Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Hadoop : Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware
Hadoop MapReduce : a programming model for large scale data processing
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
Hadoop YARN : a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications
Hadoop Common : contains libraries and utilities needed by other Hadoop modules
Apache Accumulo : is a computer software project that developed a sorted, distributed key/value store based on the BigTable technology from Google.It is a system built on top of Apache Hadoop, Apache ZooKeeper, and Apache Thrift. Written in Java, Accumulo has cell-level access labels and server-side programming mechanisms. Accumulo is the third most popular NoSQL wide column store according to the DB-Engines ranking
Apache Cassandra : is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers robust support for clusters spanning multiple datacenters,[1] with asynchronous masterless replication allowing low latency operations for all clients
Avro : is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
Apache Camel : is a rule-based routing and mediation engine that provides a Java object-based implementation of the Enterprise Integration Patterns using an API (or declarative Java Domain Specific Language) to configure routing and mediation rules. The domain-specific language means that Apache Camel can support type-safe smart completion of routing rules in an integrated development environment using regular Java code without large amounts of XML configuration files, though XML configuration inside Spring is also supported.Camel is often used with Apache ServiceMix, Apache ActiveMQ and Apache CXF in service-oriented architecture infrastructure projects
Apache Storm : is a distributed computation framework written predominantly in the Clojure programming language.A Storm application is designed as a "topology" in the shape of a directed acyclic graph (DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named streams and direct data from one node to another. Together, the topology acts as a data transformation pipeline. At a superficial level the general topology structure is similar to a MapReduce job, with the main difference being that data is processed in real-time as opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed, while a MapReduce job DAG must eventually end
Apache Derby : (previously distributed as IBM Cloudscape) is a relational database management system (RDBMS) developed by the Apache Software Foundation that can be embedded in Java programs and used for online transaction processing. It has a 2.6 MB disk-space footprint
Apache ActiveMQ : is an open source message broker written in Java together with a full Java Message Service (JMS) client. It provides "Enterprise Features" which in this case means fostering the communication from more than one client or server.
Apache Ambari : is a software project of the Apache Software Foundation, is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs
Apache Flume : is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
Apache Phoenix : is an open source, massively parallel, relational database layer on top of noSQL stores such as Apache HBase. Phoenix provides a JDBC driver that hides the intricacies of the noSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; upsert and delete rows singly and in bulk; and query data through SQL
Apache Kafka : is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs
Apache Mahout : is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily in the areas of collaborative filtering, clustering and classification. Many of the implementations use the Apache Hadoop platform
Maven : is a build automation tool used primarily for Java projects. The word maven means "accumulator of knowledge" in Yiddish. Maven addresses two aspects of building software: First, it describes how software is built, and second, it describes its dependencies. Contrary to preceding tools like Apache Ant, it uses conventions for the build procedure, and only exceptions need to be written down. An XML file describes the software project being built, its dependencies on other external modules and components, the build order, directories, and required plug-ins
Apache ZooKeeper : is a software project of the Apache Software Foundation, providing an open source distributed configuration service, synchronization service, and naming registry for large distributed systems
Oozie : is a workflow scheduler system to manage Hadoop jobs. It is a server-based Workflow Engine specialized in running workflow jobs with actions that run Hadoop MapReduce and Pig jobs. Oozie is implemented as a Java web application that runs in a Java servlet container.
Pig : is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.
Apache PDFBox : is an open source pure-Java library that can be used to create, render, print, split, merge, alter, verify and extract text and meta-data of PDF files.
OpenCV (Open Source Computer Vision) : is a library of programming functions mainly aimed at real-time computer vision, originally developed by Intel research center
Apache Samza : is an open-source project developed by the Apache Software Foundation, written in Scala. The project aims to provide a near-realtime, asynchronous computational framework for stream processing.
ORC : Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
RDD : Spark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities. The fundamental programming abstraction is called Resilient Distributed Datasets (RDDs), a logical collection of data partitioned across machines. RDDs can be created by referencing datasets in external storage systems, or by applying coarse-grained transformations (e.g. map, filter, reduce, join) on existing RDDs.
The RDD abstraction is exposed through a language-integrated API in Java, Python, Scala, and R similar to local, in-process collections. This simplifies programming complexity because the way applications manipulate RDDs is similar to manipulating local collections of data.
DataFrames : Spark SQL is a component on top of Spark Core that introduces a new data abstraction called DataFrames, which provides support for structured and semi-structured data. Spark SQL provides a domain-specific language to manipulate DataFrames in Scala, Java, or Python. It also provides SQL language support, with command-line interfaces and ODBC/JDBC server. Prior to version 1.3 of Spark, DataFrames were referred to as SchemaRDDs
Spark Streaming : Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, on a single engine
Spark MLlib : is a distributed machine learning framework on top of Spark Core that, due in large part of the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout
GraphX : GraphX is a distributed graph processing framework on top of Spark. It provides an API for expressing graph computation that can model the Pregel abstraction. It also provides an optimized runtime for this abstraction.
D3 (Data Driven Documents) : D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG, and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment