Track Awesome Bigdata Updates Daily

A curated list of awesome big data frameworks, ressources and other awesomeness.

🏠 Home · 🔍 Search · 🔥 Feed · 📮 Subscribe · ❤️ Sponsor · 😺 newTendermint/awesome-bigdata · ⭐ 13K · 🏷️ Big Data

[ Daily / Weekly / Overview ]

Feb 14, 2025

SQL-like processing

Iceberg - an open table format for huge analytic datasets. Iceberg adds tables to Trino and Spark that use a high-performance format that works just like a SQL table.

Applications

Comet - Comet provides an end-to-end model evaluation platform for AI developers, with best in class LLM evaluations, experiment tracking, and production monitoring.

Opik - Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

Business Intelligence

Lightdash (⭐4.4k) - The open source Looker alternative built on dbt

May 31, 2023

Data Ingestion

Zilla (⭐568) - An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka protocol.

Benchmarking

UCSB (⭐53) - extended Yahoo Cloud Serving Benchmark for NoSQL databases.

Applications

Substation (⭐339) - Substation is a cloud native data pipeline and transformation toolkit written in Go.

Oct 01, 2021

Time-Series Databases

InfluxDB - a time series database with optimised IO and queries, supports pgsql and influx wire protocols.

QuestDB - high-performance, open-source SQL database for applications in financial services, IoT, machine learning, DevOps and observability.

Mar 24, 2021

Internet of things and sensor data

Ably - Pub/sub messaging platform for IoT

Mar 11, 2021

Data Ingestion

Census - A reverse ETL product that let you sync data from your data warehouse to SaaS Applications. No engineering favors required—just SQL.

Data Visualization

Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.

Mar 01, 2021

Frameworks

Smooks (⭐398) - An extensible Java framework for building XML and non-XML (CSV, EDI, Java, etc...) streaming applications.

Feb 11, 2021

Scheduling

Cronicle (⭐4.2k) - Distributed, easy to install, NodeJS based, task scheduler

Data Visualization

Dash (⭐22k) - Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required

Feb 06, 2021

Data Visualization / Graph Based approach

Google Bigtable (⭐50).

Feb 02, 2021

Business Intelligence

Count - notebook-based anlytics and visualisation platform using SQL or drag-and-drop.

Jan 01, 2021

Machine Learning

Shapley (⭐218) - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.

Dec 17, 2020

Applications

HASH - open source simulation and visualization platform.

Nov 17, 2020

Machine Learning

PyTorch Geometric Temporal (⭐2.7k) - a temporal extension library for PyTorch Geometric .

Nov 05, 2020

Books / Streaming

Azure Data Engineering - A book about data engineering in general and the Azure platform specifically

Oct 02, 2020

Key-value Data Model

Graviton (⭐420) - a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).

Sep 17, 2020

Videos

Elasticsearch 7 and Elastic Stack - LiveVideo tutorial that covers searching, analyzing, and visualizing big data on a cluster with Elasticsearch, Logstash, Beats, Kibana, and more.

Sep 16, 2020

Scheduling

Dagster (⭐13k) - a data orchestrator for machine learning, analytics, and ETL.

Aug 24, 2020

Videos

Data warehouse schema design - dimensional modeling and star schema - Introduction to schema design for data warehouse using the star schema method.

Aug 19, 2020

SQL-like processing

Invantive SQL - SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.

Aug 07, 2020

SQL-like processing

Materialize (⭐5.9k) - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.

Jul 18, 2020

Key-value Data Model

GhostDB (⭐754) - a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

Data Ingestion

Apache Pulsar (⭐14k) - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.

Jul 10, 2020

Search engine and framework

Weaviate (⭐12k) - Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.

Jun 12, 2020

Books / Streaming

Grokking Streaming Systems - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether they’re right for your business. Written to be tool-agnostic, you’ll be able to apply what you learn no matter which framework you choose.

May 21, 2020

Data Ingestion

redpanda - A Kafka® replacement for mission critical systems; 10x faster. Written in C++.

May 18, 2020

Machine Learning

Little Ball of Fur (⭐707) - A subsampling library for graph structured data. Python

May 07, 2020

Data Ingestion

RudderStack (⭐4.1k) - an open source customer data infrastructure (segment, mParticle alternative) written in go.

Apr 29, 2020

Data Ingestion

Gazette (⭐739) - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.

Mar 08, 2020

Frameworks

Bistro (⭐1k) - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.

NewSQL Databases

BayesDB (⭐892) - statistic oriented SQL database.

Machine Learning

Oryx (⭐1.8k) - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.

Lambdo (⭐1) - Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.

2001 - 2010

2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.

2008 - AMPLab - Chukwa: A large-scale monitoring system.

Jan 25, 2020

Machine Learning

Karate Club (⭐2.2k) - An unsupervised machine learning library for graph structured data. Python

Jan 13, 2020

System Deployment

Linkis (⭐3.3k) - Linkis helps easily connect to various back-end computation/storage engines.

Data Visualization

DataSphere Studio (⭐3.1k) - one-stop data application development management portal.

Jan 10, 2020

Distributed Programming

Apache Spark Streaming - framework for stream processing, part of Spark.

Dec 26, 2019

NewSQL Databases

yugabyteDB (⭐9.2k) - open source, high-performance, distributed SQL database compatible with PostgreSQL.

Dec 13, 2019

Data Visualization / Graph Based approach

Monte Carlo Tree Search Papers awesome-monte-carlo-tree-search-papers (⭐663).

Dec 04, 2019

Time-Series Databases

TDengine (⭐24k) - a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data

Business Intelligence

Saiku Analytics - Open source analytics platform.

Oct 08, 2019

NewSQL Databases

KarelDB (⭐390) - a relational database backed by Apache Kafka.

Oct 06, 2019

Business Intelligence

Knowage - open source business intelligence platform. (former SpagoBi)

Oct 02, 2019

Search engine and framework

Facebook Faiss (⭐33k) - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.

Annoy (⭐13k) - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

Sep 17, 2019

Videos

Machine Learning, Data Science and Deep Learning with Python - LiveVideo tutorial that covers machine learning, Tensorflow, artificial intelligence, and neural networks.

Sep 14, 2019

Machine Learning

ML Workspace (⭐3.5k) - All-in-one web-based IDE specialized for machine learning and data science.

Sep 09, 2019

Business Intelligence

intermix.io - Performance Monitoring for Amazon Redshift

Aug 30, 2019

SQL-like processing

Apache HCatalog - table and storage management layer for Hadoop.

Jul 17, 2019

Applications

Indicative - Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.

Jul 14, 2019

Data Visualization / Graph Based approach

Graph Classification awesome-graph-classification (⭐4.8k).

Jul 08, 2019

SQL-like processing

Dremio - an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.

Jun 19, 2019

Time-Series Databases

IronDB - scalable, general-purpose time series database.

Applications

Jupyter - Notebook and project application for interactive data science and scientific computing across all programming languages.

Business Intelligence

Blazer (⭐4.6k) - business intelligence made simple.

Books / Streaming

Spark in Action & Spark in Action 2nd Ed. - Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.

Data Visualization / Graph Based approach

Kafka awesome-kafka (⭐210).

May 28, 2019

Data Visualization / Graph Based approach

Decision Tree Papers awesome-decision-tree-papers (⭐2.4k).

Fraud Detection Papers awesome-fraud-detection-papers (⭐1.7k).

Gradient Boosting Papers awesome-gradient-boosting-papers (⭐1k).

May 26, 2019

Time-Series Databases

VictoriaMetrics (⭐13k) - fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included

May 24, 2019

Distributed Programming

Ray (⭐35k) - A fast and simple framework for building and running distributed applications.

Feb 01, 2019

Data Visualization

Vega (⭐11k) - a visualization grammar.

Jan 31, 2019

Graph Data Model

JanusGraph - open-source, distributed graph database with multiple options for storage backends (Bigtable, HBase, Cassandra, etc.) and indexing backends (Elasticsearch, Solr, Lucene).

Jan 26, 2019

Data Visualization / Graph Based approach

Network Embedding awesome-network-embedding (⭐2.6k).

Community Detection awesome-community-detection (⭐2.3k).

Jan 25, 2019

Frameworks

Polyaxon (⭐3.6k) - A platform for reproducible and scalable machine learning and deep learning.

Jan 07, 2019

Machine Learning

Feast (⭐5.8k) - A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.

Nov 16, 2018

Business Intelligence

Numeracy - Fast, clean SQL client and business intelligence.

Oct 31, 2018

Interesting Readings

Monitoring Cassandra performance - Guide to monitoring Cassandra, including native methods for metrics collection.

Books / Streaming

Fusion in Action - Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and distributed data clustering.

Oct 29, 2018

Service Programming

Mara (⭐2.1k) - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

Oct 27, 2018

Books / Streaming

Data Science at Scale with Python and Dask - Data Science at Scale with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data.

Oct 21, 2018

Books / Graph Based approach

Graph-Powered Machine Learning - Alessandro Negro. Combine graph theory and models to improve machine learning projects

Oct 06, 2018

Time-Series Databases

M3DB - a distributed time series database that can be used for storing realtime metrics at long retention.

Oct 02, 2018

Data Visualization

Frappe Charts - GitHub-inspired simple and modern SVG charts for the web with zero dependencies.

Oct 01, 2018

Graph Data Model

Microsoft Graph Engine (⭐2.2k) - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.

Data Ingestion

Amazon Web Services Glue - serverless fully managed extract, transform, and load (ETL) service

Aug 25, 2018

NewSQL Databases

ActorDB (⭐1.9k) - a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.

Map-D - GPU in-memory database, big data analysis and visualization platform.

VoltDB - claims to be fastest in-memory database.

Business Intelligence

Metabase (⭐40k) - The simplest, fastest way to get business intelligence and analytics to everyone in your company.

Jul 13, 2018

Data Visualization

DevExtreme React Chart - High-performance plugin-based React chart for Bootstrap and Material Design.

Jul 09, 2018

Columnar Databases

Google BigQuery - Google's cloud offering backed by their pioneering work on Dremel.

Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.

IndexR (⭐454) - an open-source columnar storage format for fast & realtime analytic with big data.

LocustDB (⭐1.6k) - an experimental analytics database aiming to set a new standard for query performance on commodity hardware.

Jun 21, 2018

Distributed Programming

Apache Beam - an unified model and set of language-specific SDKs for defining and executing data processing workflows.

May 20, 2018

Time-Series Databases

Thanos (⭐13k) - Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.

Apr 20, 2018

Distributed Index

Pilosa (⭐2.5k) Open source distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.

Feb 26, 2018

Data Visualization / Graph Based approach

Public Datasets awesome-public-datasets (⭐62k).

Feb 19, 2018

System Deployment

Kubernetes - a system for automating deployment, scaling, and management of containerized applications.

Jan 12, 2018

Internet of things and sensor data

NetLytics (⭐9) - Analytics platform to process network data on Spark.

Dec 27, 2017

Videos

Spark in Motion - Spark in Motion teaches you how to use Spark for batch and streaming data analytics.

Dec 25, 2017

Key-value Data Model

Ignite - is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.

Dec 20, 2017

Data Ingestion

Apache NiFi - Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.

Dec 06, 2017

Graph Data Model

Neo4j - graph database written entirely in Java.

Nov 28, 2017

Distributed Programming

Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.

Nov 17, 2017

Business Intelligence

SparklineData SNAP - modern B.I platform powered by Apache Spark.

Nov 16, 2017

Books / Streaming

Kafka in Action - Kafka in Action is a fast-paced introduction to every aspect of working with Kafka you need to really reap its benefits.

Reactive Data Handling - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications capable of handling real-time processing with large data loads--free eBook!

Oct 31, 2017

Search engine and framework

Vespa - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.

Oct 27, 2017

NewSQL Databases

SenseiDB - distributed, realtime, semi-structured database.

Oct 23, 2017

Time-Series Databases

SiriDB (⭐506) Highly-scalable, robust and fast, open source time series database with cluster functionality.

Oct 14, 2017

RDBMS

Teradata - high-performance MPP data warehouse platform.

Distributed Filesystem

Apache Kudu - Hadoop's storage layer to enable fast analytics on fast data.

SQL-like processing

Aster Database - SQL-like analytic processing for MapReduce.

Oct 13, 2017

Time-Series Databases

Axibase Time Series Database - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.

Oct 11, 2017

Applications

AthenaX (⭐1.2k) - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).

Oct 08, 2017

Books / Streaming

Kafka Streams in Action - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort.

Big Data - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.

Oct 02, 2017

Graph Data Model

NodeXL - A free, open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.

Sep 27, 2017

Distributed Programming

Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.

Sep 24, 2017

Security

BDA (⭐104) - The vulnerability detector for Hadoop and Spark

Aug 03, 2017

Scheduling

Apache Airflow (⭐39k) - a platform to programmatically author, schedule and monitor workflows.

Azure Data Factory - cloud-based pipeline orchestration for on-prem, cloud and HDInsight

Jul 21, 2017

PostgreSQL forks and evolutions

PipelineDB - The Streaming SQL Database. An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables

TimescaleDB - An open-source time-series database optimized for fast ingest and complex queries

Jul 12, 2017

NewSQL Databases

Comdb2 (⭐1.4k) - a clustered RDBMS built on optimistic concurrency control techniques.

Jul 06, 2017

RDBMS

MySQL The world's most popular open source database.

PostgreSQL The world's most advanced open source database.

Distributed Programming

Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

Apache S4 - framework for stream processing, implementation of S4.

Google Dataflow - create data pipelines to help themæingest, transform and analyze data.

Google MapReduce - map reduce framework.

Google MillWheel - fault tolerant stream processing framework.

Onyx - Distributed computation for the cloud.

Pinterest Pinlater - asynchronous job execution system.

Twitter TSAR - TimeSeries AggregatoR by Twitter.

Distributed Filesystem

BeeGFS - formerly FhGFS, parallel distributed file system.

Google Megastore - scalable, highly available storage.

GridGain - GGFS, Hadoop compliant in-memory file system.

Red Hat GlusterFS - scale-out network-attached storage file system.

Document Data Model

Actian Versant - commercial object-oriented database management systems .

LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.

Microsoft Azure DocumentDB - NoSQL cloud database service with protocol support for MongoDB

MongoDB - Document-oriented database system.

RethinkDB - document database that supports queries like table joins and group by.

Key Map Data Model

Google BigTable - column-oriented distributed datastore.

Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.

Key-value Data Model

Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.

Redis - in memory key value datastore.

Graph Data Model

GCHQ Gaffer (⭐1.8k) - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.

Google Cayley (⭐15k) - open-source graph database.

Twitter FlockDB (⭐3.3k) - distributed graph database.

Columnar Databases

Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.

SQream DB - A GPU powered big data database, designed for analytics and data warehousing, with ANSI-92 compliant SQL, suitable for data sets from 10TB to 1PB.

NewSQL Databases

Google F1 - distributed SQL database built on Spanner.

Google Spanner - globally distributed semi-relational database.

SAP HANA - is an in-memory, column-oriented, relational database management system.

Time-Series Databases

Prometheus - a time series database and service monitoring system.

SQL-like processing

Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.

Cloudera Impala - framework for interactive analysis, Inspired by Dremel.

Google BigQuery - framework for interactive analysis, implementation of Dremel.

Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.

Stinger - interactive query for Hive.

Data Ingestion

Amazon Kinesis - real-time processing of streaming data at massive scale.

Embulk - open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.

Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.

LinkedIn Databus - stream of change capture events for a database.

Service Programming

Google Chubby - a lock service for loosely-coupled distributed systems.

Linkedin Norbert - cluster manager.

OpenMPI - message passing framework.

Serf - decentralized solution for service discovery and orchestration.

Machine Learning

MonkeyLearn - Text mining made easy. Extract and classify data from text.

PredictionIO - machine learning server built on Hadoop, Mahout and Cascading.

Security

Apache Eagle - real time monitoring solution

System Deployment

Apache YARN - Cluster manager.

Google Borg - job scheduling and monitoring system.

Hortonworks HOYA - application that can deploy HBase cluster on YARN.

Applications

Apache Metron - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.

Argus (⭐506) - Time series monitoring and alerting platform.

Hunk - Splunk analytics for Hadoop.

MADlib - data-processing library of an RDBMS to analyze data.

Splunk - analyzer for machine-generated data.

Sumo Logic - cloud based analyzer for machine-generated data.

Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.

Search engine and framework

Elassandra (⭐1.7k) - is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.

Enigma.io – Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.

Google Percolator - continuous indexing system.

LinkedIn Galene - search architecture at LinkedIn.

MySQL forks and evolutions

Amazon RDS - MySQL databases in Amazon's cloud.

Google Cloud SQL - MySQL databases in Google's cloud.

MySQL Cluster - MySQL implementation using NDB Cluster storage engine.

PostgreSQL forks and evolutions

Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.

Embedded Databases

BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.

LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.

Business Intelligence

BIME Analytics - business intelligence platform in the cloud.

datapine - self-service business intelligence tool in the cloud.

GoodData - platform for data products and embedded analytics.

Jedox Palo - customisable Business Intelligence platform.

Jethrodata - Interactive Big Data Analytics.

Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.

Qlik - business intelligence and analytics platform.

Redash - Open source business intelligence platform, supporting multiple data sources and planned queries.

Zoomdata - Big Data Analytics.

Data Visualization

D3 - javaScript library for manipulating documents.

FnordMetric - write SQL queries that return SVG charts rather than tables

Grafana - graphite dashboard frontend, editor and graph composer.

Graphite - scalable Realtime Graphing.

Highcharts - simple and flexible charting API.

Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data

Superset (⭐64k) - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.

Zeppelin (⭐411) - a notebook-style collaborative data analysis.

Zing Charts - JavaScript charting library for big data.

Internet of things and sensor data

ThingWorx - Rapid development and connection of intelligent systems

Interesting Readings

NoSQL Comparison - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.

2013 - 2014

2013 - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.

2013 - Google - F1: A Distributed SQL Database That Scales.

2011 - 2012

2012 - AMPLab - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.

2012 - Google - Spanner: Google’s Globally-Distributed Database.

2011 - Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.

2001 - 2010

2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and notifications base of Percolator and Caffeine.

2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets.

2006 - Google - The Chubby lock service for loosely-coupled distributed systems.

2004 - Google - MapReduce: Simplied Data Processing on Large Clusters.

2003 - Google - The Google File System.

Books / Streaming

Unified Log Processing - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business

Jun 27, 2017

Data Ingestion

Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.

Jun 19, 2017

Key-value Data Model

BTDB (⭐139) - Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more

Jun 03, 2017

Key-value Data Model

Badger - a fast, simple, efficient, and persistent key-value store written natively in Go.

May 27, 2017

Graph Data Model

AgensGraph - a new generation multi-model graph database for the modern complex data environment.

May 25, 2017

Search engine and framework

MG4J - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.

Mar 31, 2017

Distributed Programming

IBM Streams - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box.

streamsx.topology (⭐29) - Libraries to enable building IBM Streams application in Java, Python or Scala.

Internet of things and sensor data

Apache Edgent (Incubating) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.

Mar 22, 2017

Books / Distributed systems

Distributed Systems for fun and profit – Theory of distributed systems. Include parts about time and ordering, replication and impossibility results.

Feb 28, 2017

Distributed Filesystem

Microsoft Azure Data Lake Store - HDFS-compatible storage in Azure cloud

SQL-like processing

Pivotal HDB - SQL-like data warehouse system for Hadoop.

Machine Learning

Azure ML Studio - Cloud-based AzureML, R, Python Machine Learning platform

Security

Apache Ranger - Central security admin & fine-grained authorization for Hadoop

Internet of things and sensor data

Azure IoT Hub - Cloud-based bi-directional monitoring and messaging hub

Feb 23, 2017

Distributed Programming

Rackerlabs Blueflood - multi-tenant distributed metric processing system

Key Map Data Model

ScyllaDB - column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.

Graph Data Model

GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.

Benchmarking

Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.

Data Visualization

Lumify - open source big data analysis and visualization platform

Feb 08, 2017

Frameworks

Pachyderm - Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.

Feb 02, 2017

Service Programming

Hydrosphere Mist (⭐325) - a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.

Jan 26, 2017

Key-value Data Model

Edis (⭐467) - is a protocol-compatible Server replacement for Redis.

Data Ingestion

Kestrel - distributed message queue system.

Dec 23, 2016

Time-Series Databases

Beringei (⭐3.2k) - Facebook's in-memory time-series database.

Nov 23, 2016

Distributed Programming

Skale (⭐397) - High performance distributed data processing in NodeJS.

Nov 14, 2016

Applications

Rakam (⭐798) - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.

Oct 25, 2016

Key Map Data Model

Baidu Tera (⭐1.9k) - an Internet-scale database, inspired by BigTable.

Key-value Data Model

SummitDB (⭐1.4k) - an in-memory, NoSQL key/value database, with disk persistence and using the Raft consensus algorithm.

Tile38 (⭐9.2k) - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON

Graph Data Model

EliasDB (⭐1k) - a lightweight graph based database that does not require any third-party libraries.

NewSQL Databases

Bedrock - a simple, modular, networked and distributed transaction layer built atop SQLite.

Applications

411 (⭐972) - an web application for alert management resulting from scheduled searches into Elasticsearch.

Atlas (⭐3.5k) - a backend for managing dimensional time series data.

Oct 24, 2016

Distributed Filesystem

Baidu File System (⭐2.9k) - distributed filesystem.

Oct 21, 2016

Machine Learning

DataVec - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.

Deeplearning4j - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.

H2O (⭐7k) - statistical, machine learning and math runtime with Hadoop. R and Python.

Keras (⭐63k) - An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.

Mahout - An Apache-backed machine learning library for Hadoop.

ND4J - A matrix library for the JVM. Numpy for Java.

RL4J - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.

Sibyl - System for Large Scale Machine Learning at Google.

TensorFlow (⭐188k) - Library from Google for machine learning using data flow graphs.

Theano - A Python-focused machine learning library supported by the University of Montreal.

Torch - A deep learning library with a Lua API, supported by NYU and Facebook.

Velox (⭐110) - System for serving machine learning predictions.

Benchmarking

Deeplearning4j Benchmarks

Sep 29, 2016

Time-Series Databases

Druid (⭐14k) Column oriented distributed data store ideal for powering interactive applications

Riak-TS Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.

Akumuli (⭐836) Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".

Rhombus A time-series object store for Cassandra that handles all the complexity of building wide row indexes.

Dalmatiner DB (⭐693) Fast distributed metrics database

Blueflood (⭐596) A distributed system designed to ingest and process time series data

Timely (⭐381) Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.

Sep 23, 2016

Interesting Readings

Monitoring Kafka performance - Guide to monitoring Apache Kafka, including native methods for metrics collection.

Monitoring Hadoop performance - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.

Sep 14, 2016

Books / Streaming

Streaming Data - Streaming Data introduces the concepts and requirements of streaming and real-time data systems.

Storm Applied - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.

Fundamentals of Stream Processing: Application Design, Systems, and Analytics - This comprehensive, hands-on guide combining the fundamental building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.

Stream Data Processing: A Quality of Service Perspective - Presents a new paradigm suitable for stream and complex event processing.

Aug 30, 2016

SQL-like processing

Apache Calcite - framework that allows efficient translation of queries involving heterogeneous and federated data.

Aug 19, 2016

Distributed Filesystem

Ambry (⭐1.7k) - a distributed object store that supports storage of trillion of small immutable objects as well as billions of large objects.

Key-value Data Model

Bolt (⭐14k) - an embedded key-value database for Go.

BuntDB (⭐4.6k) - a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.

HyperDex (⭐1.4k) - a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.

Columnar Databases

ClickHouse - an open-source column-oriented database management system that allows generating analytical data reports in real time.

EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.

Applications

ElastAert (⭐8k) - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.

Kapacitor (⭐2.3k) - an open source framework for processing, monitoring, and alerting on time series data.

Aug 17, 2016

Data Visualization

ReCharts - A composable charting library built on React components

Jul 02, 2016

Time-Series Databases

Chronix - a time series storage built to store time series highly compressed and for fast access times.

Jun 21, 2016

Distributed Programming

Apache Gearpump - real-time big data streaming engine based on Akka.

Jun 02, 2016

Time-Series Databases

Cube - uses MongoDB to store time series data.

Newts - a time series database based on Apache Cassandra.

TrailDB - an efficient tool for storing and querying series of events.

Data Visualization

AnyChart - fast, simple and flexible JavaScript (HTML5) charting library featuring pure JS API.

May 28, 2016

Data Visualization

Bloomery (⭐17) - Web UI for Impala.

May 27, 2016

Time-Series Databases

Kairosdb (⭐1.7k) - similar to OpenTSDB but allows for Cassandra.

May 26, 2016

Distributed Programming

Twitter Heron (⭐3.6k) - Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.

May 23, 2016

Machine Learning

MOA - MOA performs big data stream mining in real time, and large scale machine learning.

May 17, 2016

Distributed Programming

Apache APEX - a unified, enterprise platform for big data stream and batch processing.

Apr 20, 2016

Data Visualization

chartd - responsive, retina-compatible charts with just an img tag.

Apr 06, 2016

Key-value Data Model

TiKV (⭐16k) - a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.

Mar 29, 2016

Distributed Programming

Netflix PigPen (⭐567) - map-reduce for Clojure which compiles to Apache Pig.

Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.

Mar 21, 2016

Graph Data Model

DGraph (⭐21k) - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.

Mar 14, 2016

Distributed Filesystem

Alluxio - reliable file sharing at memory speed across cluster frameworks.

Mar 11, 2016

2015 - 2016

2015 - Facebook - One Trillion Edges: Graph Processing at Facebook-Scale.

Mar 06, 2016

Data Ingestion

Skizze (⭐771) - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.

Mar 01, 2016

Data Visualization

Shiny - a web application framework for R.

Feb 28, 2016

Key-value Data Model

GridDB (⭐2.4k) - suitable for sensor data stored in a timeseries.

Feb 23, 2016

Applications

SnappyData (⭐1k) - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.

Jan 20, 2016

Distributed Programming

Tuktu (⭐60) - Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!

Document Data Model

RavenDB - A transactional, open-source Document Database.

Key Map Data Model

Hypertable - column-oriented distributed datastore, inspired by BigTable.

Applications

Countly - open source mobile and web analytics platform, based on Node.js & MongoDB.

Kylin - open source Distributed Analytics Engine from eBay.

Data Visualization

Redash (⭐27k) - open-source platform to query and visualize data.

Dec 14, 2015

Data Visualization

D3.compose (⭐697) - Compose complex, data-driven visualizations from reusable charts and components.

Dec 08, 2015

Machine Learning

BidMach (⭐915) - CPU and GPU-accelerated Machine Learning Library.

Dec 02, 2015

RDBMS

Oracle Database - object-relational database management system.

Nov 24, 2015

Distributed Programming

Apache REEF - retainable evaluator execution framework to simplify and unify the lower layers of big data systems.

Nov 23, 2015

Time-Series Databases

Heroic - is a scalable time series database based on Cassandra and Elasticsearch.

Nov 18, 2015

Data Visualization

Plotly.js (⭐17k) The open source javascript graphing library that powers plotly.

Nov 14, 2015

Distributed Filesystem

Lustre file system - high-performance distributed filesystem.

Machine Learning

SAMOA - distributed streaming machine learning framework.

Data Visualization

Bokeh - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of novel graphics in the style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.

2001 - 2010

2010 - Yahoo - S4: Distributed Stream Computing Platform.

Nov 13, 2015

Distributed Programming

Apache Flink - high-performance runtime, and automatic program optimization.

Apache Spark - framework for in-memory cluster computing.

Apache Storm - framework for stream processing by Twitter also on YARN.

Apache Samza - stream processing framework, based on Kafka and YARN.

Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.

Pydoop - Python MapReduce and HDFS API for Hadoop.

Distributed Filesystem

Quantcast File System QFS - open-source distributed file system.

Key Map Data Model

Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable.

Tephra (⭐157) - Transactions for HBase.

Key-value Data Model

EventStore - distributed time series database.

Graph Data Model

Apache Spark Bagel - implementation of Pregel, part of Spark.

ArangoDB - multi model distributed database.

MapGraph - Massively Parallel Graph processing on GPUs.

OrientDB - document and graph database.

Columnar Databases

Parquet - columnar storage format for Hadoop.

Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.

NewSQL Databases

CitusDB - scales out PostgreSQL through sharding and replication.

HandlerSocket - NoSQL plugin for MySQL/MariaDB.

SQL-like processing

Apache Drill - framework for interactive analysis, inspired by Dremel.

Apache Phoenix - SQL skin over HBase.

Concurrent Lingual - SQL-like query language for Cascading.

Facebook PrestoDB - distributed SQL query engine.

SparkSQL - Manipulating Structured Data Using Spark.

Tajo - distributed data warehouse system on Hadoop.

Data Ingestion

Apache Chukwa - data collection system.

Facebook Scribe (⭐3.9k) - streamed log data aggregator.

Fluentd - tool to collect events and logs.

Logstash - a tool for managing events and logs.

Service Programming

Twitter Elephant Bird (⭐1.1k) - libraries for working with LZOP-compressed data.

Scheduling

Apache Aurora - is a service scheduler that runs on top of Apache Mesos.

Apache Falcon - data management framework.

Schedoscope (⭐96) - Scala DSL for agile scheduling of Hadoop jobs.

Machine Learning

Concurrent Pattern - machine learning library for Cascading.

ENCOG - machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.

GraphLab Create - A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.

Applications

Imhotep - Large scale analytics platform by indeed.

PivotalR (⭐126) - R on Pivotal HD / HAWQ and PostgreSQL.

Qubole - auto-scaling Hadoop cluster, built-in data connectors.

Search engine and framework

ElasticSearch - Search and analytics engine based on Apache Lucene.

Google Caffeine - continuous indexing system.

MySQL forks and evolutions

Percona Server - enhanced, drop-in replacement for MySQL.

TokuDB - TokuDB is a storage engine for MySQL and MariaDB.

Embedded Databases

LevelDB (⭐37k) - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

Business Intelligence

Tableau - business intelligence platform.

Data Visualization

Kibana - visualize logs and time-stamped data

Plot.ly - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots.

Internet of things and sensor data

TempoIQ - Cloud-based sensor analytics.

Pubnub - Data stream network

Oct 17, 2015

Search engine and framework

Sphinx Search Server - fulltext search engine.

Oct 04, 2015

Data Ingestion

StreamSets Data Collector - continuous big data ingest infrastructure with a simple to use IDE.

Sep 21, 2015

Key Map Data Model

Apache Accumulo - distributed key/value store, built on Hadoop.

Apache Cassandra - column-oriented distributed datastore, inspired by BigTable.

Apache HBase - column-oriented distributed datastore, inspired by BigTable.

Sep 08, 2015

NewSQL Databases

TiDB (⭐38k) - TiDB is a distributed SQL database. Inspired by the design of Google F1.

Sep 06, 2015

Applications

Hermes (⭐819) - asynchronous message broker built on top of Kafka.

Aug 10, 2015

Distributed Filesystem

Seaweed-FS (⭐24k) - simple and highly scalable distributed file system.

Jul 03, 2015

Key Map Data Model

InfiniDB (⭐250) - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.

NewSQL Databases

Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.

Scheduling

Chronos - distributed and fault-tolerant scheduler.

Linkedin Azkaban - batch workflow job scheduler.

System Deployment

Apache Slider (⭐78) - is a YARN application to deploy existing distributed applications on YARN.

Applications

Domino - Run, scale, share, and deploy models — without any infrastructure.

Data Visualization

CartoDB (⭐2.8k) - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.

Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.

Gephi (⭐6k) - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.

Apr 20, 2015

Distributed Filesystem

Tahoe-LAFS - decentralized cloud storage system.

Apr 14, 2015

Data Ingestion

Linkedin Gobblin (⭐2.2k) - linkedin's universal data ingestion framework.

Mar 30, 2015

Frameworks

Tigon (⭐284) - High Throughput Real-time Stream Processing Framework.

Mar 22, 2015

Distributed Filesystem

Google GFS - distributed filesystem.

Mar 07, 2015

Data Visualization

Airpal (⭐2.8k) - Web UI for PrestoDB.

Feb 28, 2015

Distributed Programming

Metamarkets Druid - framework for real-time analysis of large datasets.

Jan 18, 2015

Data Visualization

D3Plus - A fairly robust set of reusable charts and styles for d3.js.

Jan 08, 2015

Data Visualization

Echarts (⭐62k) - Baidus enterprise charts.

Dec 29, 2014

Internet of things and sensor data

IFTTT - If this then that

Evrything- Making products smart

Nov 19, 2014

Data Visualization

Banana (⭐669) - visualize logs and time-stamped data stored in Solr. Port of Kibana.

Oct 22, 2014

Data Visualization / Graph Based approach

The beauty of data visualization

Designing Data Visualizations with Noah Iliinsky

Hans Rosling's 200 Countries, 200 Years, 4 Minutes

Ice Bucket Challenge Data Visualization

Oct 15, 2014

Internet of things and sensor data

2lemetry - Platform for Internet of things.

Oct 14, 2014

Data Visualization

C3 - D3-based reusable chart library

Sep 07, 2014

Data Visualization

Chartist.js (⭐78) - another open source HTML5 Charts visualization.

Aug 24, 2014

Data Visualization / Graph Based approach

Other awesome lists awesome-awesomeness (⭐32k).

Even more lists awesome (⭐345k).

Another list? list (⭐10k).

WTF! awesome-awesome-awesome (⭐2k).

Analytics awesome-analytics (⭐4k).

Aug 20, 2014

Distributed Filesystem

Disco DDFS - distributed filesystem.

Applications

Adobe spindle (⭐331) - Next-generation web analytics processing with Scala, Spark, and Parquet.

Aug 19, 2014

Key-value Data Model

Aerospike - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."

TreodeDB (⭐176) - key-value store that's replicated and sharded and provides atomic multirow writes.

Aug 17, 2014

Data Visualization

DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.

IPython - provides a rich architecture for interactive computing.

2013 - 2014

2014 - Stanford - Mining of Massive Datasets.

Aug 05, 2014

Key Map Data Model

Facebook HydraBase - evolution of HBase made by Facebook.

Columnar Databases

Columnar Storage - an explanation of what columnar storage is and when you might want it.

Actian Vector - column-oriented analytic database.

MonetDB - column store database.

Aug 03, 2014

2011 - 2012

2012 - Twitter - The Unified Logging Infrastructure for Data Analytics at Twitter.

Aug 02, 2014

Frameworks

Apache Hadoop - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).

Distributed Programming

AddThis Hydra (⭐434) - distributed data processing and storage system originally developed at AddThis.

Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.

DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.

Document Data Model

Facebook Apollo - Facebook’s Paxos-like NoSQL database.

Key-value Data Model

Oracle NoSQL Database - distributed key-value database by Oracle Corporation.

Graph Data Model

Gremlin (⭐1.9k) - graph traversal Language.

Infovore (⭐148) - RDF-centric Map/Reduce framework.

NewSQL Databases

Actian Ingres - commercially supported, open-source SQL relational database management system.

Cockroach (⭐30k) - Scalable, Geo-Replicated, Transactional Datastore.

FoundationDB - distributed database, inspired by F1.

Oracle TimesTen in-Memory Database - in-memory, relational database management system with persistence and recoverability.

Time-Series Databases

OpenTSDB - distributed time series database on top of HBase.

SQL-like processing

RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.

Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.

Data Ingestion

LinkedIn White Elephant (⭐191) - log aggregator and dashboard.

Service Programming

Spotify Luigi (⭐18k) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Scheduling

Apache Oozie - workflow job scheduler.

Benchmarking

PUMA Benchmarking - benchmark suite for MapReduce applications.

Security

Apache Sentry - security module for data stored in Hadoop.

System Deployment

Brooklyn - library that simplifies application deployment and management.

Applications

Apache OODT - capturing, processing and sharing of data for NASA's scientific archives.

Search engine and framework

HBase Coprocessor - implementation of Percolator, part of HBase.

Lily HBase Indexer - quickly and easily search for any content stored in HBase.

PostgreSQL forks and evolutions

HadoopDB - hybrid of MapReduce and DBMS.

IBM Netezza - high-performance data warehouse appliances.

Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.

RecDB - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.

Stado - open source MPP database system solely targeted at data warehousing and data mart applications.

Memcached forks and evolutions

Twemproxy (⭐12k) - A fast, light-weight proxy for memcached and redis.

Embedded Databases

Actian PSQL - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.

Data Visualization

Cytoscape - JavaScript library for visualizing complex networks.

Google Charts - simple charting API.

Peity (⭐4.2k) - Progressive SVG bar, line and pie charts.

Jul 29, 2014

Data Visualization

Freeboard (⭐6.5k) - pen source real-time dashboard builder for IOT and other web mashups.

Jul 28, 2014

NewSQL Databases

Datomic - distributed database designed to enable scalable, flexible and intelligent applications.

Data Ingestion

Heka (⭐3.4k) - open source stream processing software system.

Jul 19, 2014

Business Intelligence

Chartio - lean business intelligence platform to visualize and explore your data.

Jul 12, 2014

Key-value Data Model

Riak (⭐4k) - a decentralized datastore.

Applications

Eventhub (⭐1.3k) - open source event analytics platform.

Data Visualization

Arbor (⭐2.7k) - graph visualization library using web workers and jQuery.

Cubism (⭐4.9k) - JavaScript library for time series visualization.

Envisionjs (⭐1.6k) - dynamic HTML5 visualization.

Matplotlib (⭐21k) - plotting with Python.

Recline (⭐2.2k) - simple but powerful library for building data applications in pure Javascript and HTML.

Sigma.js (⭐11k) - JavaScript library dedicated to graph drawing.

Jul 11, 2014

Machine Learning

brain (⭐8k) - Neural networks in JavaScript.

convnetjs (⭐11k) - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.

Decider (⭐384) - Flexible and Extensible Machine Learning in Ruby.

MLPNeuralNet (⭐900) - Fast multilayer perceptron neural network library for iOS and Mac OS X.

nupic (⭐6.3k) - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.

scikit-learn (⭐61k) - scikit-learn: machine learning in Python.

Applications

Snowplow (⭐6.9k) - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.

Jul 10, 2014

Key-value Data Model

Tarantool (⭐3.5k) - an efficient NoSQL database and a Lua application server.

Jul 09, 2014

SQL-like processing

Apache Hive - SQL-like data warehouse system for Hadoop.

Datasalt Splout SQL - full SQL query engine for big datasets.

Spark Catalyst (⭐41k) - is a Query Optimization Framework for Spark and Shark.

Data Ingestion

Apache Flume - service to manage large amount of log data.

Apache Kafka - distributed publish-subscribe messaging system.

Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.

HIHO (⭐91) - framework for connecting disparate data sources with Hadoop.

LinkedIn Kamikaze (⭐22) - utility package for compressing sorted integer arrays.

Netflix Suro (⭐795) - log agregattor like Storm and Samza based on Chukwa.

Pinterest Secor (⭐1.8k) - is a service implementing Kafka log persistance.

Service Programming

Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.

Apache Avro - data serialization system.

Apache Curator - Java libraries for Apache ZooKeeper.

Apache Karaf - OSGi runtime that runs on top of any OSGi framework.

Apache Thrift - framework to build binary protocols.

Apache Zookeeper - centralized service for process management.

Spring XD (⭐475) - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.

Twitter Finagle - asynchronous network stack for the JVM.

Scheduling

Sparrow (⭐319) - scheduling platform.

Machine Learning

etcML - text classification with machine learning.

Etsy Conjecture (⭐360) - scalable Machine Learning in Scalding.

MLbase - distributed machine learning libraries for the BDAS stack.

Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.

Vowpal Wabbit (⭐8.5k) - learning system sponsored by Microsoft and Yahoo!.

WEKA - suite of machine learning software.

Benchmarking

Apache Hadoop Benchmarking - micro-benchmarks for testing Hadoop performances.

Berkeley SWIM Benchmark (⭐129) - real-world big data workload benchmark.

Intel HiBench (⭐1.5k) - a Hadoop benchmark suite.

Security

Apache Knox Gateway - single point of secure access for Hadoop clusters.

System Deployment

Apache Ambari - operational framework for Hadoop management.

Apache Bigtop - system deployment framework for the Hadoop ecosystem.

Apache Helix - cluster management framework.

Apache Mesos - cluster manager.

Apache Whirr - set of libraries for running cloud services.

Buildoop - Similar to Apache BigTop based on Groovy language.

Cloudera HUE - web application for interacting with Hadoop.

Facebook Prism - multi datacenters replication system.

Google Omega - job scheduling and monitoring system.

Marathon (⭐4.1k) - Mesos framework for long-running services.

Applications

Apache Nutch - open source web crawler.

Apache Tika - content analysis toolkit.

Eclipse BIRT - Eclipse-based reporting system.

SparkR - R frontend for Spark.

Search engine and framework

Apache Lucene - Search engine library.

Apache Solr - Search platform for Apache Lucene.

LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.

LinkedIn Cleo (⭐564) - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.

LinkedIn Zoie (⭐369) - is a realtime search/indexing system written in Java.

MySQL forks and evolutions

Drizzle - evolution of MySQL 6.0.

MariaDB - enhanced, drop-in replacement for MySQL.

ProxySQL (⭐25) - High Performance Proxy for MySQL.

WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.

Memcached forks and evolutions

Facebook McDipper - key/value cache for flash storage.

Facebook Memcached - fork of Memcache.

Twitter Fatcache (⭐1.3k) - key/value cache for flash storage.

Twitter Twemcache (⭐930) - fork of Memcache.

Embedded Databases

HanoiDB (⭐307) - Erlang LSM BTree Storage.

RocksDB - embeddable persistent key-value store for fast storage based on LevelDB.

Business Intelligence

Jaspersoft - powerful business intelligence suite.

Microsoft - business intelligence software and platform.

Pentaho - business intelligence platform.

Data Visualization

Chart.js - open source HTML5 Charts visualizations.

NVD3 - chart components for d3.js.

Interesting Readings

Big Data Benchmark - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.

2013 - 2014

2013 - AMPLab - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.

2013 - AMPLab - MLbase: A Distributed Machine-learning System.

2013 - AMPLab - Shark: SQL and Rich Analytics at Scale.

2013 - AMPLab - GraphX: A Resilient Distributed Graph System on Spark.

2013 - Microsoft - Scalable Progressive Analytics on Big Data in the Cloud.

2013 - Metamarkets - Druid: A Real-time Analytical Data Store.

2013 - Google - Online, Asynchronous Schema Change in F1.

2013 - Google - MillWheel: Fault-Tolerant Stream Processing at Internet Scale.

2013 - Facebook - Scuba: Diving into Data at Facebook.

2013 - Facebook - Unicorn: A System for Searching the Social Graph.

2013 - Facebook - Scaling Memcache at Facebook.

2011 - 2012

2012 - AMPLab - Blink and It’s Done: Interactive Queries on Very Large Data.

2012 - AMPLab - Fast and Interactive Analytics over Hadoop Data with Spark.

2012 - AMPLab - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.

2012 - Microsoft - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.

2012 - Microsoft - Paxos Made Parallel.

2012 - Google - Processing a trillion cells per mouse click.

2011 - AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.

2011 - AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.

2001 - 2010

2010 - Facebook - Finding a needle in Haystack: Facebook’s photo storage.

2010 - AMPLab - Spark: Cluster Computing with Working Sets.

2010 - Google - Pregel: A System for Large-Scale Graph Processing.

2007 - Amazon - Dynamo: Amazon’s Highly Available Key-value Store.

2006 - Google - Bigtable: A Distributed Storage System for Structured Data.

Jul 08, 2014

Distributed Programming

AMPLab SIMR - run Spark on Hadoop MapReduce v1.

Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.

Apache Gora - framework for in-memory data model and persistence.

Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.

Apache Pig - high level language to express data analysis programs for Hadoop.

Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.

Cascalog - data processing and querying library.

Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.

Concurrent Cascading - framework for data management/analytics on Hadoop.

Damballa Parkour (⭐257) - MapReduce library for Clojure.

Datasalt Pangool (⭐57) - alternative MapReduce paradigm.

Facebook Corona - Hadoop enhancement which removes single point of failure.

Facebook Peregrine - Map Reduce framework.

Facebook Scuba - distributed in-memory datastore.

JAQL - declarative programming language for working with structured, semi-structured and unstructured data.

Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.

Nokia Disco - MapReduce framework developed by Nokia.

Stratosphere - general purpose cluster computing framework.

Twitter Scalding (⭐3.5k) - Scala library for Map Reduce jobs, built on Cascading.

Twitter Summingbird (⭐2.1k) - Streaming MapReduce with Scalding and Storm, by Twitter.

Distributed Filesystem

Apache HDFS - a way to store large files across multiple machines.

Ceph Filesystem - software storage platform designed.

Facebook Haystack - object storage system.

Document Data Model

Crate Data - is an open source massively scalable data store. It requires zero administration.

jumboDB - document oriented datastore over Hadoop.

MarkLogic - Schema-agnostic Enterprise NoSQL database technology.

Key-value Data Model

ElephantDB (⭐558) - Distributed database specialized in exporting data from Hadoop.

LinkedIn Krati (⭐26) - is a simple persistent data store with very low latency and high throughput.

Linkedin Voldemort - distributed key/value storage system.

Storehaus (⭐465) - library to work with asynchronous key value stores, by Twitter.

Graph Data Model

Apache Giraph - implementation of Pregel, based on Hadoop.

Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.

Google Pregel - graph processing framework.

GraphX - resilient Distributed Graph System on Spark.

Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.

Phoebus (⭐384) - framework for large scale graph processing.

Titan - distributed graph database, built over Cassandra.

NewSQL Databases

Amazon RedShift - data warehouse service, based on PostgreSQL.

H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.

Haeinsa (⭐158) - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.

InfiniSQL - infinity scalable RDBMS.

MemSQL - in memory SQL database witho optimized columnar storage on flash.

NuoDB - SQL/ACID compliant distributed database.

Sky - database used for flexible, high performance analysis of behavioral data.

SymmetricDS - open source software for both file and database synchronization.