Track Awesome Hadoop Updates Daily

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources

🏠 Home · 🔍 Search · 🔥 Feed · 📮 Subscribe · ❤️ Sponsor · 😺 youngwookim/awesome-hadoop · ⭐ 1K · 🏷️ Big Data

[ Daily / Weekly / Overview ]

Jan 24, 2022

Realtime Data Processing

Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing. It supports exactly once stream processing.

Packaging, Provisioning and Monitoring

Logit.io - Send logs from Hadoop to Elasticsearch for monitoring and alerting.

Jan 14, 2019

Hadoop

Apache Hadoop Ozone - An Object Store for Apache Hadoop

Realtime Data Processing

Apache Druid (incubating) - A high-performance, column-oriented, distributed data store.

Machine learning and Big Data analytics

Apache Hivemall (incubating) - Apache Hivemall is a scalable machine learning library that runs on Apache Hive, Spark and Pig.

Apr 12, 2018

Data Management

Confluent Schema registry for Kafka (⭐1.8k) - Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving Avro schemas.

Hortonworks Schema Registry (⭐216) - Schema Registry is a framework to build metadata repositories.

Libraries and Tools

Schema Registry UI (⭐398) - Web tool for the Confluent Schema Registry in order to create / view / search / evolve / view history & configure Avro schemas of your Kafka cluster.

Misc.

Flume Plugins

Dec 11, 2017

SQL on Hadoop

Apache Impala - Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Data Management

Apache Kudu - Kudu provides a combination of fast inserts/updates and efficient columnar scans to enable multiple real-time analytic workloads across a single storage layer, complementing HDFS and Apache HBase.

Libraries and Tools

snakebite - A pure python HDFS client

Apache Parquet - Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

Apache Superset (incubating) - Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

Realtime Data Processing

Apache Pulsar (incubating) - Apache Pulsar (incubating) is a highly scalable, low latency messaging platform running on commodity hardware. It provides simple pub-sub semantics over topics, guaranteed at-least-once delivery of messages, automatic cursor management for subscribers, and cross-datacenter replication.

Distributed Computing and Programming

Apache Livy (incubating) - Apache Livy (incubating) is web service that exposes a REST interface for managing long running Apache Spark contexts in your cluster. With Livy, new applications can be built on top of Apache Spark that require fine grained interaction with many Spark contexts.

Search

Apache Solr - Apache Solr is an open source search platform built upon a Java library called Lucene.

Machine learning and Big Data analytics

BigDL - BigDL is a distributed deep learning library for Apache Spark; with BigDL, users can write their deep learning applications as standard Spark programs, which can directly run on top of existing Spark or Hadoop clusters.

Hadoop and Big Data Events

DataWorks Summit

Spark Summit

Jul 21, 2016

Websites

How to monitor Hadoop metrics

Jun 02, 2016

SQL on Hadoop

Apache Hive - The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL

Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop

Apache Trafodion

Workflow, Lifecycle and Governance

Apache AirFlow (⭐28k) - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines

Machine learning and Big Data analytics

Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets

Jun 01, 2016

SQL on Hadoop

Apache Drill - Schema-free SQL Query Engine

Feb 23, 2016

Hadoop

Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop

Distributed Computing and Programming

Apache Apex (incubating) - Enterprise-grade unified stream and batch processing engine.

Hadoop and Big Data Events

ApacheCon

Strata + Hadoop World

Nov 14, 2015

Hadoop

Elasticsearch Hadoop (⭐1.9k) - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.

Apache Ignite - Distributed in-memory platform

YARN

mpich2-yarn (⭐110) - Running MPICH2 on Yarn

SQL on Hadoop

Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.

Data Management

Apache Calcite - A Dynamic Data Management Framework

Workflow, Lifecycle and Governance

Apache Falcon - Data management and processing platform

Apache NiFi - A dataflow system

DSL

vahara (⭐51) - Machine learning and natural language processing with Apache Pig

Libraries and Tools

Elephant Bird (⭐1.1k) - Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.

Realtime Data Processing

Apache Storm

Apache Samza

Distributed Computing and Programming

SparkHub - A community site for Apache Spark

Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.

Search

ElasticSearch

Machine learning and Big Data analytics

Apache Lens

Oct 18, 2015

Search Engine Framework

Apache Nutch - Apache Nutch is a highly extensible and scalable open source web crawler software project.

Websites

Hadoop360

Sep 09, 2015

Libraries and Tools

Oozie Eclipse Plugin - A graphical editor for editing Apache Oozie workflows inside Eclipse.

Jul 28, 2015

Machine learning and Big Data analytics

RHadoop (⭐762) including RHDFS, RHBase, RMR2, plyrmr

Jul 27, 2015

NoSQL

Apache Phoenix - A SQL skin over HBase supporting secondary indices

SQL on Hadoop

Apache Phoenix A SQL skin over HBase supporting secondary indices

Lingual - SQL interface for Cascading (MR/Tez job generator)

Data Management

Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies

Workflow, Lifecycle and Governance

Luigi - Python package that helps you build complex pipelines of batch jobs

Realtime Data Processing

Apache Spark

Jul 23, 2015

Distributed Computing and Programming

Spark Packages - A community index of packages for Apache Spark

Jul 09, 2015

Benchmark

YCSB (⭐4.3k) - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.

Jul 08, 2015

Workflow, Lifecycle and Governance

Apache Oozie - Apache Oozie

Jul 01, 2015

Data Ingestion and Integration

Gobblin from LinkedIn (⭐2.1k) - Universal data ingestion framework for Hadoop

Jun 29, 2015

Machine learning and Big Data analytics

Oryx 2 (⭐1.8k) - Lambda architecture on Spark, Kafka for real-time large scale machine learning

Misc.

Hive Plugins

Storage Handler

Libraries and tools
- https://github.com/forward3d/rbhive (⭐98)
- https://github.com/synctree/activerecord-hive-adapter (⭐5)
- https://github.com/hrp/sequel-hive-adapter (⭐5)
- https://github.com/forward/node-hive (⭐61)
- https://github.com/recruitcojp/WebHive (⭐18)
- shib (⭐194) - WebUI for query engines: Hive and Presto
- https://github.com/dmorel/Thrift-API-HiveClient2 (⭐0) (Perl - HiveServer2)
- PyHive (⭐1.6k) - Python interface to Hive and Presto
- https://github.com/recruitcojp/OdbcHive (⭐7)
- HiveRunner (⭐244) - An Open Source unit test framework for hadoop hive queries based on JUnit4
- Beetest (⭐71) - A super simple utility for testing Apache Hive scripts locally for non-Java developers.
- Hive_test (⭐66)- Unit test framework for hive and hive-service

Jun 18, 2015

Hadoop

Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets

May 15, 2015

Machine learning and Big Data analytics

Apache Mahout

Apr 28, 2015

Libraries and Tools

Apache Zeppelin - A web-based notebook that enables interactive data analytics

Jan 29, 2015

Hadoop

Crunch (⭐207) - Go-based toolkit for ETL and feature extraction on Hadoop

SQL on Hadoop

Apache Tajo - Data warehouse system for Apache Hadoop

Libraries and Tools

hdfs - A native go client for HDFS (⭐1.2k)

Packaging, Provisioning and Monitoring

inviso (⭐200) - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.

Search

Banana (⭐667) - Kibana port for Apache Solr

Security

Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.

Apache Sentry - An authorization module for Hadoop

Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.

Websites

AWS BigData Blog

Dec 30, 2014

Books

Hadoop in Practice, Second Edition

Hadoop in Action, Second Edition

Oct 29, 2014

YARN

Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.

Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.

Data Ingestion and Integration

Apache Sqoop - Apache Sqoop

Apache Kafka - Apache Kafka

Presentations

Docker based Hadoop provisioning

Jul 14, 2014

Hadoop

Genie (⭐1.6k) - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.

NoSQL

Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

OpenTSDB - The Scalable Time Series Database

Apache Cassandra

Data Ingestion and Integration

Suro (⭐772) - Netflix's distributed Data Pipeline

DSL

PigPen (⭐542) - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.

Libraries and Tools

Hue - A Web interface for analyzing data with Apache Hadoop.

Apache Thrift

Apache Avro - Apache Avro is a data serialization system.

Spring for Apache Hadoop

Distributed Computing and Programming

Apache Spark

Apache Crunch

Cascading - Cascading is the proven application development platform for building data applications on Hadoop.

Packaging, Provisioning and Monitoring

Apache Ambari - Apache Ambari

ankush (⭐21) - A big data cluster management tool that creates and manages clusters of different technologies.

Apache Zookeeper - Apache Zookeeper

Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework

Machine learning and Big Data analytics

MLlib - MLlib is Apache Spark's scalable machine learning library.

R - R is a free software environment for statistical computing and graphics.

Websites

The Hadoop Ecosystem Table

Hadoop illuminated - Open Source Hadoop Book

Presentations

Apache Hadoop In Theory And Practice

Hadoop Operations at LinkedIn

Hadoop Performance at LinkedIn

Books

Hadoop: The Definitive Guide

Hadoop Operations

Apache Hadoop Yarn

HBase: The Definitive Guide

Programming Pig

Programming Hive

Jul 09, 2014

Hadoop

hadoopy (⭐243) - Python MapReduce library written in Cython.

mrjob (⭐2.6k) - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.

pydoop - Pydoop is a package that provides a Python API for Hadoop.

hdfs-du (⭐228) - HDFS-DU is an interactive visualization of the Hadoop distributed file system.

White Elephant (⭐191) - Hadoop log aggregator and dashboard

Packaging, Provisioning and Monitoring

Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem

Websites

Hadoop Weekly

Jul 08, 2014

Hadoop

Apache Hadoop - Apache Hadoop

SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.

GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework

NoSQL

Apache HBase - Apache HBase

happybase (⭐595) - A developer-friendly Python library to interact with Apache HBase.

Hannibal (⭐170) - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.

Haeinsa (⭐158) - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase

hindex (⭐588) - Secondary Index for HBase

Workflow, Lifecycle and Governance

Azkaban

Data Ingestion and Integration

Apache Flume - Apache Flume

DSL

Apache Pig - Apache Pig

Apache DataFu - A collection of libraries for working with large-scale data in Hadoop

packetpig (⭐301) - Open Source Big Data Security Analytics

akela (⭐76) - Mozilla's utility library for Hadoop, HBase, Pig, etc.

seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop

Lipstick (⭐460) - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig

Libraries and Tools

Kite Software Development Kit - A set of libraries, tools, examples, and documentation

gohadoop (⭐307) - Native go clients for Apache Hadoop YARN.

Packaging, Provisioning and Monitoring

Ganglia Monitoring System

Benchmark

Big Data Benchmark

HiBench (⭐1.3k)