Track Awesome Bigdata Updates Weekly
A curated list of awesome big data frameworks, ressources and other awesomeness.
🏠 Home · 🔍 Search · 🔥 Feed · 📮 Subscribe · ❤️ Sponsor · 😺 newTendermint/awesome-bigdata · ⭐ 12K · 🏷️ Big Data
May 29 - Jun 04, 2023
Data Ingestion
- Zilla (⭐155) - An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka protocol.
Benchmarking
- UCSB (⭐31) - extended Yahoo Cloud Serving Benchmark for NoSQL databases.
Applications
- Substation (⭐173) - Substation is a cloud native data pipeline and transformation toolkit written in Go.
Sep 27 - Oct 03, 2021
Time-Series Databases
- InfluxDB - a time series database with optimised IO and queries, supports pgsql and influx wire protocols.
- QuestDB - high-performance, open-source SQL database for applications in financial services, IoT, machine learning, DevOps and observability.
Mar 22 - Mar 28, 2021
Internet of things and sensor data
- Ably - Pub/sub messaging platform for IoT
Mar 08 - Mar 14, 2021
Data Ingestion
- Census - A reverse ETL product that let you sync data from your data warehouse to SaaS Applications. No engineering favors required—just SQL.
Data Visualization
- Dekart - Large scale geospatial analytics for Google BigQuery based on Kepler.gl.
Mar 01 - Mar 07, 2021
Frameworks
- Smooks (⭐365) - An extensible Java framework for building XML and non-XML (CSV, EDI, Java, etc...) streaming applications.
Feb 08 - Feb 14, 2021
Scheduling
- Cronicle (⭐2.1k) - Distributed, easy to install, NodeJS based, task scheduler
Data Visualization
- Dash (⭐19k) - Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required
Feb 01 - Feb 07, 2021
Business Intelligence
- Count - notebook-based anlytics and visualisation platform using SQL or drag-and-drop.
Data Visualization / Graph Based approach
Jan 04 - Jan 10, 2021
Machine Learning
- Shapley (⭐197) - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
Dec 21 - Dec 27, 2020
Applications
- HASH - open source simulation and visualization platform.
Nov 23 - Nov 29, 2020
Machine Learning
- PyTorch Geometric Temporal (⭐2.1k) - a temporal extension library for PyTorch Geometric .
Nov 09 - Nov 15, 2020
Books / Streaming
- Azure Data Engineering - A book about data engineering in general and the Azure platform specifically
Oct 05 - Oct 11, 2020
Key-value Data Model
- Graviton (⭐411) - a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).
Sep 21 - Sep 27, 2020
Scheduling
- Dagster (⭐7.5k) - a data orchestrator for machine learning, analytics, and ETL.
Videos
- Elasticsearch 7 and Elastic Stack - LiveVideo tutorial that covers searching, analyzing, and visualizing big data on a cluster with Elasticsearch, Logstash, Beats, Kibana, and more.
Aug 31 - Sep 06, 2020
Videos
- Data warehouse schema design - dimensional modeling and star schema - Introduction to schema design for data warehouse using the star schema method.
Aug 24 - Aug 30, 2020
SQL-like processing
- Invantive SQL - SQL engine for online and on-premise use with integrated local data replication and 70+ connectors.
Aug 10 - Aug 16, 2020
SQL-like processing
- Materialize (⭐5.1k) - is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.
Jul 20 - Jul 26, 2020
Key-value Data Model
- GhostDB (⭐737) - a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.
Data Ingestion
- Apache Pulsar (⭐13k) - a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
Jul 13 - Jul 19, 2020
Search engine and framework
- Weaviate (⭐6.2k) - Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.
Jun 15 - Jun 21, 2020
Books / Streaming
- Grokking Streaming Systems - Grokking Streaming Systems helps you unravel what streaming systems are, how they work, and whether they’re right for your business. Written to be tool-agnostic, you’ll be able to apply what you learn no matter which framework you choose.
May 25 - May 31, 2020
Data Ingestion
- redpanda - A Kafka® replacement for mission critical systems; 10x faster. Written in C++.
Machine Learning
- Little Ball of Fur (⭐643) - A subsampling library for graph structured data. Python
May 11 - May 17, 2020
Data Ingestion
- RudderStack (⭐3.6k) - an open source customer data infrastructure (segment, mParticle alternative) written in go.
May 04 - May 10, 2020
Data Ingestion
- Gazette (⭐469) - Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
Mar 09 - Mar 15, 2020
Frameworks
- Bistro (⭐1k) - general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
NewSQL Databases
- BayesDB (⭐887) - statistic oriented SQL database.
Machine Learning
- Oryx (⭐1.8k) - Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.
- Lambdo (⭐1) - Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.
2001 - 2010
- 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
- 2008 - AMPLab - Chukwa: A large-scale monitoring system.
Jan 27 - Feb 02, 2020
Machine Learning
- Karate Club (⭐1.9k) - An unsupervised machine learning library for graph structured data. Python
Jan 20 - Jan 26, 2020
System Deployment
- Linkis (⭐3.1k) - Linkis helps easily connect to various back-end computation/storage engines.
Data Visualization
- DataSphere Studio (⭐2.6k) - one-stop data application development management portal.
Jan 13 - Jan 19, 2020
Distributed Programming
- Apache Spark Streaming - framework for stream processing, part of Spark.
Dec 30 - Jan 05, 2019
NewSQL Databases
- yugabyteDB (⭐7.8k) - open source, high-performance, distributed SQL database compatible with PostgreSQL.
Dec 16 - Dec 22, 2019
Data Visualization / Graph Based approach
- Monte Carlo Tree Search Papers awesome-monte-carlo-tree-search-papers (⭐532).
Dec 09 - Dec 15, 2019
Time-Series Databases
- TDengine (⭐21k) - a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data
Business Intelligence
- Saiku Analytics - Open source analytics platform.
Oct 14 - Oct 20, 2019
NewSQL Databases
- KarelDB (⭐383) - a relational database backed by Apache Kafka.
Oct 07 - Oct 13, 2019
Search engine and framework
- Facebook Faiss (⭐22k) - is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
- Annoy (⭐11k) - is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.
Business Intelligence
Sep 23 - Sep 29, 2019
Videos
- Machine Learning, Data Science and Deep Learning with Python - LiveVideo tutorial that covers machine learning, Tensorflow, artificial intelligence, and neural networks.
Sep 16 - Sep 22, 2019
Machine Learning
- ML Workspace (⭐3k) - All-in-one web-based IDE specialized for machine learning and data science.
Business Intelligence
- intermix.io - Performance Monitoring for Amazon Redshift
Sep 02 - Sep 08, 2019
SQL-like processing
- Apache HCatalog - table and storage management layer for Hadoop.
Jul 22 - Jul 28, 2019
Applications
- Indicative - Web & mobile analytics tool, with data warehouse (AWS, BigQuery) integration.
Jul 15 - Jul 21, 2019
SQL-like processing
- Dremio - an open-source, SQL-like Data-as-a-Service Platform based on Apache Arrow.
Data Visualization / Graph Based approach
- Graph Classification awesome-graph-classification (⭐4.6k).
Jun 24 - Jun 30, 2019
Time-Series Databases
- IronDB - scalable, general-purpose time series database.
Applications
- Jupyter - Notebook and project application for interactive data science and scientific computing across all programming languages.
Business Intelligence
- Blazer (⭐3.6k) - business intelligence made simple.
Books / Streaming
- Spark in Action & Spark in Action 2nd Ed. - Spark in Action teaches you the theory and skills you need to effectively handle batch and streaming data using Spark. Fully updated for Spark 2.0.
Data Visualization / Graph Based approach
- Kafka awesome-kafka (⭐168).
Jun 03 - Jun 09, 2019
Data Visualization / Graph Based approach
- Decision Tree Papers awesome-decision-tree-papers (⭐2.2k).
- Fraud Detection Papers awesome-fraud-detection-papers (⭐1.4k).
- Gradient Boosting Papers awesome-gradient-boosting-papers (⭐923).
May 27 - Jun 02, 2019
Distributed Programming
- Ray (⭐26k) - A fast and simple framework for building and running distributed applications.
Time-Series Databases
- VictoriaMetrics (⭐8.6k) - fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included
Feb 04 - Feb 10, 2019
Graph Data Model
- JanusGraph - open-source, distributed graph database with multiple options for storage backends (Bigtable, HBase, Cassandra, etc.) and indexing backends (Elasticsearch, Solr, Lucene).
Data Visualization
- Vega (⭐10k) - a visualization grammar.
Jan 28 - Feb 03, 2019
Frameworks
- Polyaxon (⭐3.3k) - A platform for reproducible and scalable machine learning and deep learning.
Data Visualization / Graph Based approach
- Network Embedding awesome-network-embedding (⭐2.5k).
- Community Detection awesome-community-detection (⭐2.1k).
Jan 14 - Jan 20, 2019
Machine Learning
- Feast (⭐4.3k) - A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.
Nov 12 - Nov 18, 2018
Business Intelligence
- Numeracy - Fast, clean SQL client and business intelligence.
Oct 29 - Nov 04, 2018
Service Programming
- Mara (⭐2k) - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
Interesting Readings
- Monitoring Cassandra performance - Guide to monitoring Cassandra, including native methods for metrics collection.
Books / Streaming
- Fusion in Action - Fusion in Action teaches you to build a full-featured data analytics pipeline, including document and data search and distributed data clustering.
Oct 22 - Oct 28, 2018
Books / Streaming
- Data Science at Scale with Python and Dask - Data Science at Scale with Python and Dask teaches you how to build distributed data projects that can handle huge amounts of data.
Oct 15 - Oct 21, 2018
Books / Graph Based approach
- Graph-Powered Machine Learning - Alessandro Negro. Combine graph theory and models to improve machine learning projects
Oct 01 - Oct 07, 2018
Graph Data Model
- Microsoft Graph Engine (⭐2.1k) - a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.
Time-Series Databases
- M3DB - a distributed time series database that can be used for storing realtime metrics at long retention.
Data Ingestion
- Amazon Web Services Glue - serverless fully managed extract, transform, and load (ETL) service
Data Visualization
- Frappe Charts - GitHub-inspired simple and modern SVG charts for the web with zero dependencies.
Aug 20 - Aug 26, 2018
NewSQL Databases
- ActorDB (⭐1.9k) - a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.
- Map-D - GPU in-memory database, big data analysis and visualization platform.
- VoltDB - claims to be fastest in-memory database.
Business Intelligence
- Metabase (⭐32k) - The simplest, fastest way to get business intelligence and analytics to everyone in your company.
Jul 09 - Jul 15, 2018
Columnar Databases
- Google BigQuery - Google's cloud offering backed by their pioneering work on Dremel.
- Amazon Redshift - Amazon's cloud offering, also based on a columnar datastore backend.
- IndexR (⭐449) - an open-source columnar storage format for fast & realtime analytic with big data.
- LocustDB (⭐1.5k) - an experimental analytics database aiming to set a new standard for query performance on commodity hardware.
Data Visualization
- DevExtreme React Chart - High-performance plugin-based React chart for Bootstrap and Material Design.
Jun 18 - Jun 24, 2018
Distributed Programming
- Apache Beam - an unified model and set of language-specific SDKs for defining and executing data processing workflows.
May 14 - May 20, 2018
Time-Series Databases
- Thanos (⭐12k) - Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.
Apr 16 - Apr 22, 2018
Distributed Index
- Pilosa (⭐2.5k) Open source distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.
Feb 26 - Mar 04, 2018
Data Visualization / Graph Based approach
- Public Datasets awesome-public-datasets (⭐54k).
Feb 19 - Feb 25, 2018
System Deployment
- Kubernetes - a system for automating deployment, scaling, and management of containerized applications.
Jan 08 - Jan 14, 2018
Internet of things and sensor data
- NetLytics (⭐10) - Analytics platform to process network data on Spark.
Dec 25 - Dec 31, 2017
Key-value Data Model
- Ignite - is an in-memory key-value data store providing full SQL-compliant data access that can optionally be backed by disk storage.
Videos
- Spark in Motion - Spark in Motion teaches you how to use Spark for batch and streaming data analytics.
Dec 18 - Dec 24, 2017
Data Ingestion
- Apache NiFi - Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems.
Dec 04 - Dec 10, 2017
Graph Data Model
- Neo4j - graph database written entirely in Java.
Nov 27 - Dec 03, 2017
Distributed Programming
- Baidu Bigflow - an interface that allows for writing distributed computing programs providing lots of simple, flexible, powerful APIs to easily handle data of any scale.
Nov 13 - Nov 19, 2017
Business Intelligence
- SparklineData SNAP - modern B.I platform powered by Apache Spark.
Books / Streaming
- Kafka in Action - Kafka in Action is a fast-paced introduction to every aspect of working with Kafka you need to really reap its benefits.
- Reactive Data Handling - Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications capable of handling real-time processing with large data loads--free eBook!
Oct 30 - Nov 05, 2017
Search engine and framework
- Vespa - is an engine for low-latency computation over large data sets. It stores and indexes your data such that queries, selection and processing over the data can be performed at serving time.
Oct 23 - Oct 29, 2017
NewSQL Databases
- SenseiDB - distributed, realtime, semi-structured database.
Time-Series Databases
- SiriDB (⭐480) Highly-scalable, robust and fast, open source time series database with cluster functionality.
Oct 09 - Oct 15, 2017
RDBMS
- Teradata - high-performance MPP data warehouse platform.
Distributed Filesystem
- Apache Kudu - Hadoop's storage layer to enable fast analytics on fast data.
Time-Series Databases
- Axibase Time Series Database - Integrated time series database on top of HBase with built-in visualization, rule-engine and SQL support.
SQL-like processing
- Aster Database - SQL-like analytic processing for MapReduce.
Applications
- AthenaX (⭐1.2k) - a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
Oct 02 - Oct 08, 2017
Graph Data Model
- NodeXL - A free, open-source template for Microsoft® Excel® 2007, 2010, 2013 and 2016 that makes it easy to explore network graphs.
Books / Streaming
- Kafka Streams in Action - Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort.
- Big Data - Big Data teaches you to build big data systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data.
Sep 25 - Oct 01, 2017
Distributed Programming
- Wallaroo - The ultrafast and elastic data processing engine. Big or fast data - no fuss, no Java needed.
Sep 18 - Sep 24, 2017
Security
- BDA (⭐103) - The vulnerability detector for Hadoop and Spark
Jul 31 - Aug 06, 2017
Scheduling
- Apache Airflow (⭐30k) - a platform to programmatically author, schedule and monitor workflows.
- Azure Data Factory - cloud-based pipeline orchestration for on-prem, cloud and HDInsight
Jul 17 - Jul 23, 2017
PostgreSQL forks and evolutions
- PipelineDB - The Streaming SQL Database. An open-source relational database that runs SQL queries continuously on streams, incrementally storing results in tables
- TimescaleDB - An open-source time-series database optimized for fast ingest and complex queries
Jul 10 - Jul 16, 2017
NewSQL Databases
- Comdb2 (⭐1.2k) - a clustered RDBMS built on optimistic concurrency control techniques.
Jul 03 - Jul 09, 2017
RDBMS
- MySQL The world's most popular open source database.
- PostgreSQL The world's most advanced open source database.
Distributed Programming
- Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
- Apache S4 - framework for stream processing, implementation of S4.
- Google Dataflow - create data pipelines to help themæingest, transform and analyze data.
- Google MapReduce - map reduce framework.
- Google MillWheel - fault tolerant stream processing framework.
- Onyx - Distributed computation for the cloud.
- Pinterest Pinlater - asynchronous job execution system.
- Twitter TSAR - TimeSeries AggregatoR by Twitter.
Distributed Filesystem
- BeeGFS - formerly FhGFS, parallel distributed file system.
- Google Megastore - scalable, highly available storage.
- GridGain - GGFS, Hadoop compliant in-memory file system.
- Red Hat GlusterFS - scale-out network-attached storage file system.
Document Data Model
- Actian Versant - commercial object-oriented database management systems .
- LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
- Microsoft Azure DocumentDB - NoSQL cloud database service with protocol support for MongoDB
- MongoDB - Document-oriented database system.
- RethinkDB - document database that supports queries like table joins and group by.
Key Map Data Model
- Google BigTable - column-oriented distributed datastore.
- Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
Key-value Data Model
- Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
- Redis - in memory key value datastore.
Graph Data Model
- GCHQ Gaffer (⭐1.7k) - Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
- Google Cayley (⭐15k) - open-source graph database.
- Twitter FlockDB (⭐3.3k) - distributed graph database.
Columnar Databases
- Pivotal Greenplum - purpose-built, dedicated analytic data warehouse that offers a columnar engine as well as a traditional row-based one.
- SQream DB - A GPU powered big data database, designed for analytics and data warehousing, with ANSI-92 compliant SQL, suitable for data sets from 10TB to 1PB.
NewSQL Databases
- Google F1 - distributed SQL database built on Spanner.
- Google Spanner - globally distributed semi-relational database.
- SAP HANA - is an in-memory, column-oriented, relational database management system.
Time-Series Databases
- Prometheus - a time series database and service monitoring system.
SQL-like processing
- Actian SQL for Hadoop - high performance interactive SQL access to all Hadoop data.
- Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
- Google BigQuery - framework for interactive analysis, implementation of Dremel.
- Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
- Stinger - interactive query for Hive.
Data Ingestion
- Amazon Kinesis - real-time processing of streaming data at massive scale.
- Embulk - open-source bulk data loader that helps data transfer between various databases, storages, file formats, and cloud services.
- Google Photon - geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
- LinkedIn Databus - stream of change capture events for a database.
Service Programming
- Google Chubby - a lock service for loosely-coupled distributed systems.
- Linkedin Norbert - cluster manager.
- OpenMPI - message passing framework.
- Serf - decentralized solution for service discovery and orchestration.
Machine Learning
- MonkeyLearn - Text mining made easy. Extract and classify data from text.
- PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
Security
- Apache Eagle - real time monitoring solution
System Deployment
- Apache YARN - Cluster manager.
- Google Borg - job scheduling and monitoring system.
- Hortonworks HOYA - application that can deploy HBase cluster on YARN.
Applications
- Apache Metron - a platform that integrates a variety of open source big data technologies in order to offer a centralized tool for security monitoring and analysis.
- Argus (⭐497) - Time series monitoring and alerting platform.
- Hunk - Splunk analytics for Hadoop.
- MADlib - data-processing library of an RDBMS to analyze data.
- Splunk - analyzer for machine-generated data.
- Sumo Logic - cloud based analyzer for machine-generated data.
- Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.
Search engine and framework
- Elassandra (⭐1.7k) - is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.
- Enigma.io – Freemium robust web application for exploring, filtering, analyzing, searching and exporting massive datasets scraped from across the Web.
- Google Percolator - continuous indexing system.
- LinkedIn Galene - search architecture at LinkedIn.
MySQL forks and evolutions
- Amazon RDS - MySQL databases in Amazon's cloud.
- Google Cloud SQL - MySQL databases in Google's cloud.
- MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
PostgreSQL forks and evolutions
- Yahoo Everest - multi-peta-byte database / MPP derived by PostgreSQL.
Embedded Databases
- BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
- LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
Business Intelligence
- BIME Analytics - business intelligence platform in the cloud.
- datapine - self-service business intelligence tool in the cloud.
- GoodData - platform for data products and embedded analytics.
- Jedox Palo - customisable Business Intelligence platform.
- Jethrodata - Interactive Big Data Analytics.
- Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
- Qlik - business intelligence and analytics platform.
- Redash - Open source business intelligence platform, supporting multiple data sources and planned queries.
- Zoomdata - Big Data Analytics.
Data Visualization
- D3 - javaScript library for manipulating documents.
- FnordMetric - write SQL queries that return SVG charts rather than tables
- Grafana - graphite dashboard frontend, editor and graph composer.
- Graphite - scalable Realtime Graphing.
- Highcharts - simple and flexible charting API.
- Metricsgraphic.js - a library built on top of D3 that is optimized for time-series data
- Superset (⭐52k) - a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
- Zeppelin (⭐413) - a notebook-style collaborative data analysis.
- Zing Charts - JavaScript charting library for big data.
Internet of things and sensor data
- ThingWorx - Rapid development and connection of intelligent systems
Interesting Readings
- NoSQL Comparison - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.
2013 - 2014
- 2013 - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
- 2013 - Google - F1: A Distributed SQL Database That Scales.
2011 - 2012
- 2012 - AMPLab - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
- 2012 - Google - Spanner: Google’s Globally-Distributed Database.
- 2011 - Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.
2001 - 2010
- 2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
- 2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
- 2006 - Google - The Chubby lock service for loosely-coupled distributed systems.
- 2004 - Google - MapReduce: Simplied Data Processing on Large Clusters.
- 2003 - Google - The Google File System.
Books / Streaming
- Unified Log Processing - Unified Log Processing is a practical guide to implementing a unified log of event streams (Kafka or Kinesis) in your business
Jun 26 - Jul 02, 2017
Data Ingestion
- Alooma - data pipeline as a service enabling moving data sources such as MySQL into data warehouses.
Jun 19 - Jun 25, 2017
Key-value Data Model
- BTDB (⭐126) - Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more
May 29 - Jun 04, 2017
Key-value Data Model
- Badger - a fast, simple, efficient, and persistent key-value store written natively in Go.
May 22 - May 28, 2017
Graph Data Model
- AgensGraph - a new generation multi-model graph database for the modern complex data environment.
Search engine and framework
- MG4J - MG4J (Managing Gigabytes for Java) is a full-text search engine for large document collections written in Java. It is highly customisable, high-performance and provides state-of-the-art features and new research algorithms.
Mar 27 - Apr 02, 2017
Distributed Programming
- IBM Streams - platform for distributed processing and real-time analytics. Provides toolkits for advanced analytics like geospatial, time series, etc. out of the box.
- streamsx.topology (⭐27) - Libraries to enable building IBM Streams application in Java, Python or Scala.
Internet of things and sensor data
- Apache Edgent (Incubating) - a programming model and micro-kernel style runtime that can be embedded in gateways and small footprint edge devices enabling local, real-time, analytics on the edge devices.
Mar 20 - Mar 26, 2017
Books / Distributed systems
- Distributed Systems for fun and profit – Theory of distributed systems. Include parts about time and ordering, replication and impossibility results.
Feb 27 - Mar 05, 2017
Distributed Filesystem
- Microsoft Azure Data Lake Store - HDFS-compatible storage in Azure cloud
SQL-like processing
- Pivotal HDB - SQL-like data warehouse system for Hadoop.
Machine Learning
- Azure ML Studio - Cloud-based AzureML, R, Python Machine Learning platform
Security
- Apache Ranger - Central security admin & fine-grained authorization for Hadoop
Internet of things and sensor data
- Azure IoT Hub - Cloud-based bi-directional monitoring and messaging hub
Feb 20 - Feb 26, 2017
Distributed Programming
- Rackerlabs Blueflood - multi-tenant distributed metric processing system
Key Map Data Model
- ScyllaDB - column-oriented distributed datastore written in C++, totally compatible with Apache Cassandra.
Graph Data Model
- GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
Benchmarking
- Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.
Data Visualization
- Lumify - open source big data analysis and visualization platform
Feb 06 - Feb 12, 2017
Frameworks
- Pachyderm - Pachyderm is a data storage platform built on Docker and Kubernetes to provide reproducible data processing and analysis.
Jan 30 - Feb 05, 2017
Service Programming
- Hydrosphere Mist (⭐323) - a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
Jan 23 - Jan 29, 2017
Key-value Data Model
- Edis (⭐467) - is a protocol-compatible Server replacement for Redis.
Data Ingestion
- Kestrel (⭐6) - distributed message queue system.
Dec 19 - Dec 25, 2016
Time-Series Databases
- Beringei (⭐3.2k) - Facebook's in-memory time-series database.
Nov 21 - Nov 27, 2016
Distributed Programming
- Skale (⭐397) - High performance distributed data processing in NodeJS.
Nov 14 - Nov 20, 2016
Applications
- Rakam (⭐796) - open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
Oct 24 - Oct 30, 2016
Distributed Filesystem
- Baidu File System (⭐2.8k) - distributed filesystem.
Key Map Data Model
- Baidu Tera (⭐1.9k) - an Internet-scale database, inspired by BigTable.
Key-value Data Model
- SummitDB (⭐1.4k) - an in-memory, NoSQL key/value database, with disk persistance and using the Raft consensus algorithm.
- Tile38 (⭐8.6k) - a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON
Graph Data Model
- EliasDB (⭐965) - a lightweight graph based database that does not require any third-party libraries.
NewSQL Databases
- Bedrock - a simple, modular, networked and distributed transaction layer built atop SQLite.
Applications
- 411 (⭐973) - an web application for alert management resulting from scheduled searches into Elasticsearch.
- Atlas (⭐3.3k) - a backend for managing dimensional time series data.
Oct 17 - Oct 23, 2016
Machine Learning
- DataVec - A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
- Deeplearning4j - Fast, open deep learning for the JVM (Java, Scala, Clojure). A neural network configuration layer powered by a C++ library. Uses Spark and Hadoop to train nets on multiple GPUs and CPUs.
- H2O (⭐6.3k) - statistical, machine learning and math runtime with Hadoop. R and Python.
- Keras (⭐58k) - An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.
- Mahout - An Apache-backed machine learning library for Hadoop.
- ND4J - A matrix library for the JVM. Numpy for Java.
- RL4J - Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.
- Sibyl - System for Large Scale Machine Learning at Google.
- TensorFlow (⭐175k) - Library from Google for machine learning using data flow graphs.
- Theano - A Python-focused machine learning library supported by the University of Montreal.
- Torch - A deep learning library with a Lua API, supported by NYU and Facebook.
- Velox (⭐109) - System for serving machine learning predictions.
Benchmarking
Sep 26 - Oct 02, 2016
Time-Series Databases
- Druid (⭐13k) Column oriented distributed data store ideal for powering interactive applications
- Riak-TS Riak TS is the only enterprise-grade NoSQL time series database optimized specifically for IoT and Time Series data.
- Akumuli (⭐818) Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
- Rhombus A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
- Dalmatiner DB (⭐699) Fast distributed metrics database
- Blueflood (⭐591) A distributed system designed to ingest and process time series data
- Timely (⭐367) Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.
Sep 19 - Sep 25, 2016
Interesting Readings
- Monitoring Kafka performance - Guide to monitoring Apache Kafka, including native methods for metrics collection.
- Monitoring Hadoop performance - Guide to monitoring Hadoop, with an overview of Hadoop architecture, and native methods for metrics collection.
Sep 12 - Sep 18, 2016
Books / Streaming
- Streaming Data - Streaming Data introduces the concepts and requirements of streaming and real-time data systems.
- Storm Applied - Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.
- Fundamentals of Stream Processing: Application Design, Systems, and Analytics - This comprehensive, hands-on guide combining the fundamental building blocks and emerging research in stream processing is ideal for application designers, system builders, analytic developers, as well as students and researchers in the field.
- Stream Data Processing: A Quality of Service Perspective - Presents a new paradigm suitable for stream and complex event processing.
Aug 29 - Sep 04, 2016
SQL-like processing
- Apache Calcite - framework that allows efficient translation of queries involving heterogeneous and federated data.
Aug 15 - Aug 21, 2016
Distributed Filesystem
- Ambry (⭐1.6k) - a distributed object store that supports storage of trillion of small immutable objects as well as billions of large objects.
Key-value Data Model
- Bolt (⭐14k) - an embedded key-value database for Go.
- BuntDB (⭐4.1k) - a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.
- HyperDex (⭐1.4k) - a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.
Columnar Databases
- ClickHouse - an open-source column-oriented database management system that allows generating analytical data reports in real time.
- EventQL - a distributed, column-oriented database built for large-scale event collection and analytics.
Applications
- ElastAert (⭐7.9k) - ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
- Kapacitor (⭐2.2k) - an open source framework for processing, monitoring, and alerting on time series data.
Data Visualization
- ReCharts - A composable charting library built on React components
Jun 27 - Jul 03, 2016
Time-Series Databases
- Chronix - a time series storage built to store time series highly compressed and for fast access times.
Jun 20 - Jun 26, 2016
Distributed Programming
- Apache Gearpump - real-time big data streaming engine based on Akka.
May 30 - Jun 05, 2016
Time-Series Databases
- Cube - uses MongoDB to store time series data.
- Newts - a time series database based on Apache Cassandra.
- TrailDB - an efficient tool for storing and querying series of events.
Data Visualization
- AnyChart - fast, simple and flexible JavaScript (HTML5) charting library featuring pure JS API.
May 23 - May 29, 2016
Distributed Programming
- Twitter Heron (⭐3.6k) - Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.
Time-Series Databases
- Kairosdb (⭐1.7k) - similar to OpenTSDB but allows for Cassandra.
Machine Learning
- MOA - MOA performs big data stream mining in real time, and large scale machine learning.
Data Visualization
- Bloomery (⭐16) - Web UI for Impala.
May 16 - May 22, 2016
Distributed Programming
- Apache APEX - a unified, enterprise platform for big data stream and batch processing.
Apr 18 - Apr 24, 2016
Data Visualization
- chartd - responsive, retina-compatible charts with just an img tag.
Apr 04 - Apr 10, 2016
Key-value Data Model
- TiKV (⭐13k) - a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.
Mar 28 - Apr 03, 2016
Distributed Programming
- Netflix PigPen (⭐547) - map-reduce for Clojure which compiles to Apache Pig.
- Streamdrill - useful for counting activities of event streams over different time windows and finding the most active one.
Mar 21 - Mar 27, 2016
Graph Data Model
- DGraph (⭐19k) - A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.
Mar 14 - Mar 20, 2016
Distributed Filesystem
- Alluxio - reliable file sharing at memory speed across cluster frameworks.
Mar 07 - Mar 13, 2016
2015 - 2016
- 2015 - Facebook - One Trillion Edges: Graph Processing at Facebook-Scale.
Feb 29 - Mar 06, 2016
Data Ingestion
- Skizze (⭐775) - sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
Data Visualization
- Shiny - a web application framework for R.
Feb 22 - Feb 28, 2016
Key-value Data Model
- GridDB (⭐2.1k) - suitable for sensor data stored in a timeseries.
Applications
- SnappyData (⭐1k) - a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
Jan 18 - Jan 24, 2016
Distributed Programming
- Tuktu (⭐58) - Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!
Document Data Model
- RavenDB - A transactional, open-source Document Database.
Key Map Data Model
- Hypertable - column-oriented distributed datastore, inspired by BigTable.
Applications
- Countly - open source mobile and web analytics platform, based on Node.js & MongoDB.
- Kylin - open source Distributed Analytics Engine from eBay.
Data Visualization
- Redash (⭐23k) - open-source platform to query and visualize data.
Dec 21 - Dec 27, 2015
Data Visualization
- D3.compose (⭐698) - Compose complex, data-driven visualizations from reusable charts and components.
Dec 14 - Dec 20, 2015
Machine Learning
- BidMach (⭐914) - CPU and GPU-accelerated Machine Learning Library.
Dec 07 - Dec 13, 2015
RDBMS
- Oracle Database - object-relational database management system.
Nov 30 - Dec 06, 2015
Distributed Programming
- Apache REEF - retainable evaluator execution framework to simplify and unify the lower layers of big data systems.
Time-Series Databases
- Heroic - is a scalable time series database based on Cassandra and Elasticsearch.
Nov 23 - Nov 29, 2015
Data Visualization
- Plotly.js (⭐16k) The open source javascript graphing library that powers plotly.
Nov 16 - Nov 22, 2015
Distributed Programming
- Apache Flink - high-performance runtime, and automatic program optimization.
- Apache Spark - framework for in-memory cluster computing.
- Apache Storm - framework for stream processing by Twitter also on YARN.
- Apache Samza - stream processing framework, based on Kafka and YARN.
- Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
- Pydoop - Python MapReduce and HDFS API for Hadoop.
Distributed Filesystem
- Lustre file system - high-performance distributed filesystem.
- Quantcast File System QFS - open-source distributed file system.
Key Map Data Model
- Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable.
- Tephra (⭐157) - Transactions for HBase.
Key-value Data Model
- EventStore - distributed time series database.
Graph Data Model
- Apache Spark Bagel - implementation of Pregel, part of Spark.
- ArangoDB - multi model distributed database.
- MapGraph - Massively Parallel Graph processing on GPUs.
- OrientDB - document and graph database.
Columnar Databases
- Parquet - columnar storage format for Hadoop.
- Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.
NewSQL Databases
- CitusDB - scales out PostgreSQL through sharding and replication.
- HandlerSocket - NoSQL plugin for MySQL/MariaDB.
SQL-like processing
- Apache Drill - framework for interactive analysis, inspired by Dremel.
- Apache Phoenix - SQL skin over HBase.
- Concurrent Lingual - SQL-like query language for Cascading.
- Facebook PrestoDB - distributed SQL query engine.
- SparkSQL - Manipulating Structured Data Using Spark.
- Tajo - distributed data warehouse system on Hadoop.
Data Ingestion
- Apache Chukwa - data collection system.
- Facebook Scribe (⭐3.9k) - streamed log data aggregator.
- Fluentd - tool to collect events and logs.
- Logstash - a tool for managing events and logs.
Service Programming
- Twitter Elephant Bird (⭐1.1k) - libraries for working with LZOP-compressed data.
Scheduling
- Apache Aurora - is a service scheduler that runs on top of Apache Mesos.
- Apache Falcon - data management framework.
- Schedoscope (⭐95) - Scala DSL for agile scheduling of Hadoop jobs.
Machine Learning
- Concurrent Pattern - machine learning library for Cascading.
- ENCOG - machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.
- GraphLab Create - A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
- SAMOA - distributed streaming machine learning framework.
Applications
- Imhotep - Large scale analytics platform by indeed.
- PivotalR (⭐119) - R on Pivotal HD / HAWQ and PostgreSQL.
- Qubole - auto-scaling Hadoop cluster, built-in data connectors.
Search engine and framework
- ElasticSearch - Search and analytics engine based on Apache Lucene.
- Google Caffeine - continuous indexing system.
MySQL forks and evolutions
- Percona Server - enhanced, drop-in replacement for MySQL.
- TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
Embedded Databases
- LevelDB (⭐33k) - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
Business Intelligence
- Tableau - business intelligence platform.
Data Visualization
- Bokeh - A powerful Python interactive visualization library that targets modern web browsers for presentation, with the goal of providing elegant, concise construction of novel graphics in the style of D3.js, but also delivering this capability with high-performance interactivity over very large or streaming datasets.
- Kibana - visualize logs and time-stamped data
- Plot.ly - Easy-to-use web service that allows for rapid creation of complex charts, from heatmaps to histograms. Upload data to create and style charts with Plotly's online spreadsheet. Fork others' plots.
Internet of things and sensor data
- TempoIQ - Cloud-based sensor analytics.
- Pubnub - Data stream network
2001 - 2010
- 2010 - Yahoo - S4: Distributed Stream Computing Platform.
Oct 19 - Oct 25, 2015
Search engine and framework
- Sphinx Search Server - fulltext search engine.
Oct 05 - Oct 11, 2015
Data Ingestion
- StreamSets Data Collector - continuous big data ingest infrastructure with a simple to use IDE.
Sep 28 - Oct 04, 2015
Key Map Data Model
- Apache Accumulo - distributed key/value store, built on Hadoop.
- Apache Cassandra - column-oriented distributed datastore, inspired by BigTable.
- Apache HBase - column-oriented distributed datastore, inspired by BigTable.
Sep 14 - Sep 20, 2015
NewSQL Databases
- TiDB (⭐34k) - TiDB is a distributed SQL database. Inspired by the design of Google F1.
Sep 07 - Sep 13, 2015
Applications
- Hermes (⭐755) - asynchronous message broker built on top of Kafka.
Aug 17 - Aug 23, 2015
Distributed Filesystem
- Seaweed-FS (⭐17k) - simple and highly scalable distributed file system.
Jul 06 - Jul 12, 2015
Key Map Data Model
- InfiniDB (⭐246) - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
NewSQL Databases
- Pivotal GemFire XD - Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
Scheduling
- Chronos - distributed and fault-tolerant scheduler.
- Linkedin Azkaban - batch workflow job scheduler.
System Deployment
- Apache Slider (⭐77) - is a YARN application to deploy existing distributed applications on YARN.
Applications
- Domino - Run, scale, share, and deploy models — without any infrastructure.
Data Visualization
- CartoDB (⭐2.7k) - open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
- Crossfilter - JavaScript library for exploring large multivariate datasets in the browser. Works well with dc.js and d3.js.
- Gephi (⭐5.3k) - An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.
Apr 27 - May 03, 2015
Distributed Filesystem
- Tahoe-LAFS - decentralized cloud storage system.
Apr 20 - Apr 26, 2015
Data Ingestion
- Linkedin Gobblin (⭐2.1k) - linkedin's universal data ingestion framework.
Apr 06 - Apr 12, 2015
Frameworks
- Tigon (⭐279) - High Throughput Real-time Stream Processing Framework.
Mar 23 - Mar 29, 2015
Distributed Filesystem
- Google GFS - distributed filesystem.
Mar 09 - Mar 15, 2015
Data Visualization
- Airpal (⭐2.8k) - Web UI for PrestoDB.
Mar 02 - Mar 08, 2015
Distributed Programming
- Metamarkets Druid - framework for real-time analysis of large datasets.
Jan 19 - Jan 25, 2015
Data Visualization
- D3Plus - A fairly robust set of reusable charts and styles for d3.js.
Jan 12 - Jan 18, 2015
Data Visualization
- Echarts (⭐55k) - Baidus enterprise charts.
Jan 05 - Jan 11, 2015
Internet of things and sensor data
- IFTTT - If this then that
- Evrything- Making products smart
Nov 24 - Nov 30, 2014
Data Visualization
- Banana (⭐672) - visualize logs and time-stamped data stored in Solr. Port of Kibana.
Oct 27 - Nov 02, 2014
Data Visualization / Graph Based approach
Oct 20 - Oct 26, 2014
Data Visualization
- C3 - D3-based reusable chart library
Internet of things and sensor data
- 2lemetry - Platform for Internet of things.
Sep 08 - Sep 14, 2014
Data Visualization
- Chartist.js (⭐43) - another open source HTML5 Charts visualization.
Aug 25 - Aug 31, 2014
Distributed Filesystem
- Disco DDFS - distributed filesystem.
Key-value Data Model
- Aerospike - NoSQL flash-optimized, in-memory. Open source and "Server code in 'C' (not Java or Erlang) precisely tuned to avoid context switching and memory copies."
- TreodeDB (⭐178) - key-value store that's replicated and sharded and provides atomic multirow writes.
Applications
- Adobe spindle (⭐333) - Next-generation web analytics processing with Scala, Spark, and Parquet.
Data Visualization / Graph Based approach
- Other awesome lists awesome-awesomeness (⭐30k).
- Even more lists awesome (⭐254k).
- Another list? list (⭐9k).
- Analytics awesome-analytics (⭐3.6k).
Aug 18 - Aug 24, 2014
Data Visualization
- DC.js - Dimensional charting built to work natively with crossfilter rendered using d3.js. Excellent for connecting charts/additional metadata to hover events in D3.
- IPython - provides a rich architecture for interactive computing.
2013 - 2014
- 2014 - Stanford - Mining of Massive Datasets.
Aug 11 - Aug 17, 2014
Key Map Data Model
- Facebook HydraBase - evolution of HBase made by Facebook.
Columnar Databases
- Columnar Storage - an explanation of what columnar storage is and when you might want it.
- Actian Vector - column-oriented analytic database.
- MonetDB - column store database.
Aug 04 - Aug 10, 2014
Frameworks
- Apache Hadoop - framework for distributed processing. Integrates MapReduce (parallel processing), YARN (job scheduling) and HDFS (distributed file system).
Distributed Programming
- AddThis Hydra (⭐438) - distributed data processing and storage system originally developed at AddThis.
- Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
- DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
Document Data Model
- Facebook Apollo - Facebook’s Paxos-like NoSQL database.
Key-value Data Model
- Oracle NoSQL Database - distributed key-value database by Oracle Corporation.
Graph Data Model
- Gremlin (⭐1.9k) - graph traversal Language.
- Infovore (⭐148) - RDF-centric Map/Reduce framework.
NewSQL Databases
- Actian Ingres - commercially supported, open-source SQL relational database management system.
- Cockroach (⭐27k) - Scalable, Geo-Replicated, Transactional Datastore.
- Datomic - distributed database designed to enable scalable, flexible and intelligent applications.
- FoundationDB - distributed database, inspired by F1.
- Oracle TimesTen in-Memory Database - in-memory, relational database management system with persistence and recoverability.
Time-Series Databases
- OpenTSDB - distributed time series database on top of HBase.
SQL-like processing
- RainstorDB - database for storing petabyte-scale volumes of structured and semi-structured data.
- Trafodion - enterprise-class SQL-on-HBase solution targeting big data transactional or operational workloads.
Data Ingestion
- Heka (⭐3.4k) - open source stream processing software system.
- LinkedIn White Elephant (⭐190) - log aggregator and dashboard.
Service Programming
- Spotify Luigi (⭐17k) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
Scheduling
- Apache Oozie - workflow job scheduler.
Benchmarking
- PUMA Benchmarking - benchmark suite for MapReduce applications.
Security
- Apache Sentry - security module for data stored in Hadoop.
System Deployment
- Brooklyn - library that simplifies application deployment and management.
Applications
- Apache OODT - capturing, processing and sharing of data for NASA's scientific archives.
Search engine and framework
- HBase Coprocessor - implementation of Percolator, part of HBase.
- Lily HBase Indexer - quickly and easily search for any content stored in HBase.
PostgreSQL forks and evolutions
- HadoopDB - hybrid of MapReduce and DBMS.
- IBM Netezza - high-performance data warehouse appliances.
- Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
- RecDB - Open Source Recommendation Engine Built Entirely Inside PostgreSQL.
- Stado - open source MPP database system solely targeted at data warehousing and data mart applications.
Memcached forks and evolutions
- Twemproxy (⭐12k) - A fast, light-weight proxy for memcached and redis.
Embedded Databases
- Actian PSQL - ACID-compliant DBMS developed by Pervasive Software, optimized for embedding in applications.
Data Visualization
- Cytoscape - JavaScript library for visualizing complex networks.
- Freeboard (⭐6.4k) - pen source real-time dashboard builder for IOT and other web mashups.
- Google Charts - simple charting API.
- Peity (⭐4.2k) - Progressive SVG bar, line and pie charts.
2011 - 2012
- 2012 - Twitter - The Unified Logging Infrastructure for Data Analytics at Twitter.
Jul 21 - Jul 27, 2014
Business Intelligence
- Chartio - lean business intelligence platform to visualize and explore your data.
Jul 14 - Jul 20, 2014
Distributed Programming
- AMPLab SIMR - run Spark on Hadoop MapReduce v1.
- Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
- Apache Gora - framework for in-memory data model and persistence.
- Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
- Apache Pig - high level language to express data analysis programs for Hadoop.
- Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
- Cascalog - data processing and querying library.
- Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
- Concurrent Cascading - framework for data management/analytics on Hadoop.
- Damballa Parkour (⭐258) - MapReduce library for Clojure.
- Datasalt Pangool (⭐57) - alternative MapReduce paradigm.
- Facebook Corona - Hadoop enhancement which removes single point of failure.
- Facebook Peregrine - Map Reduce framework.
- Facebook Scuba - distributed in-memory datastore.
- JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
- Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
- Nokia Disco - MapReduce framework developed by Nokia.
- Stratosphere - general purpose cluster computing framework.
- Twitter Scalding (⭐3.4k) - Scala library for Map Reduce jobs, built on Cascading.
- Twitter Summingbird (⭐2.1k) - Streaming MapReduce with Scalding and Storm, by Twitter.
Distributed Filesystem
- Apache HDFS - a way to store large files across multiple machines.
- Ceph Filesystem - software storage platform designed.
- Facebook Haystack - object storage system.
Document Data Model
- Crate Data - is an open source massively scalable data store. It requires zero administration.
- jumboDB - document oriented datastore over Hadoop.
- MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
Key-value Data Model
- ElephantDB (⭐553) - Distributed database specialized in exporting data from Hadoop.
- LinkedIn Krati (⭐26) - is a simple persistent data store with very low latency and high throughput.
- Linkedin Voldemort - distributed key/value storage system.
- Riak (⭐3.8k) - a decentralized datastore.
- Storehaus (⭐466) - library to work with asynchronous key value stores, by Twitter.
- Tarantool (⭐3.2k) - an efficient NoSQL database and a Lua application server.
Graph Data Model
- Apache Giraph - implementation of Pregel, based on Hadoop.
- Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
- Google Pregel - graph processing framework.
- GraphX - resilient Distributed Graph System on Spark.
- Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
- Phoebus (⭐382) - framework for large scale graph processing.
- Titan - distributed graph database, built over Cassandra.
NewSQL Databases
- Amazon RedShift - data warehouse service, based on PostgreSQL.
- H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
- Haeinsa (⭐157) - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
- InfiniSQL - infinity scalable RDBMS.
- MemSQL - in memory SQL database witho optimized columnar storage on flash.
- NuoDB - SQL/ACID compliant distributed database.
- Sky - database used for flexible, high performance analysis of behavioral data.
- SymmetricDS - open source software for both file and database synchronization.
SQL-like processing
- Apache Hive - SQL-like data warehouse system for Hadoop.
- Datasalt Splout SQL - full SQL query engine for big datasets.
- Spark Catalyst (⭐36k) - is a Query Optimization Framework for Spark and Shark.
Data Ingestion
- Apache Flume - service to manage large amount of log data.
- Apache Kafka - distributed publish-subscribe messaging system.
- Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
- HIHO (⭐90) - framework for connecting disparate data sources with Hadoop.
- LinkedIn Kamikaze (⭐22) - utility package for compressing sorted integer arrays.
- Netflix Suro (⭐777) - log agregattor like Storm and Samza based on Chukwa.
- Pinterest Secor (⭐1.8k) - is a service implementing Kafka log persistance.
Service Programming
- Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.
- Apache Avro - data serialization system.
- Apache Curator - Java libaries for Apache ZooKeeper.
- Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
- Apache Thrift - framework to build binary protocols.
- Apache Zookeeper - centralized service for process management.
- Spring XD (⭐479) - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
- Twitter Finagle - asynchronous network stack for the JVM.
Scheduling
- Sparrow (⭐314) - scheduling platform.
Machine Learning
- brain (⭐8k) - Neural networks in JavaScript.
- convnetjs (⭐11k) - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
- Decider (⭐386) - Flexible and Extensible Machine Learning in Ruby.
- etcML - text classification with machine learning.
- Etsy Conjecture (⭐358) - scalable Machine Learning in Scalding.
- MLbase - distributed machine learning libraries for the BDAS stack.
- MLPNeuralNet (⭐895) - Fast multilayer perceptron neural network library for iOS and Mac OS X.
- nupic (⭐6.3k) - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
- scikit-learn (⭐54k) - scikit-learn: machine learning in Python.
- Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
- Vowpal Wabbit (⭐8.2k) - learning system sponsored by Microsoft and Yahoo!.
- WEKA - suite of machine learning software.
Benchmarking
- Apache Hadoop Benchmarking - micro-benchmarks for testing Hadoop performances.
- Berkeley SWIM Benchmark (⭐126) - real-world big data workload benchmark.
- Intel HiBench (⭐1.4k) - a Hadoop benchmark suite.
Security
- Apache Knox Gateway - single point of secure access for Hadoop clusters.
System Deployment
- Apache Ambari - operational framework for Hadoop mangement.
- Apache Bigtop - system deployment framework for the Hadoop ecosystem.
- Apache Helix - cluster management framework.
- Apache Mesos - cluster manager.
- Apache Whirr - set of libraries for running cloud services.
- Buildoop - Similar to Apache BigTop based on Groovy language.
- Cloudera HUE - web application for interacting with Hadoop.
- Facebook Prism - multi datacenters replication system.
- Google Omega - job scheduling and monitoring system.
- Marathon (⭐4.1k) - Mesos framework for long-running services.
Applications
- Apache Nutch - open source web crawler.
- Apache Tika - content analysis toolkit.
- Eclipse BIRT - Eclipse-based reporting system.
- Eventhub (⭐1.3k) - open source event analytics platform.
- Snowplow (⭐6.5k) - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
- SparkR - R frontend for Spark.
Search engine and framework
- Apache Lucene - Search engine library.
- Apache Solr - Search platform for Apache Lucene.
- LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
- LinkedIn Cleo (⭐558) - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
- LinkedIn Zoie (⭐362) - is a realtime search/indexing system written in Java.
MySQL forks and evolutions
- Drizzle - evolution of MySQL 6.0.
- MariaDB - enhanced, drop-in replacement for MySQL.
- ProxySQL (⭐24) - High Performance Proxy for MySQL.
- WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.
Memcached forks and evolutions
- Facebook McDipper - key/value cache for flash storage.
- Facebook Memcached - fork of Memcache.
- Twitter Fatcache (⭐1.3k) - key/value cache for flash storage.
- Twitter Twemcache (⭐930) - fork of Memcache.
Embedded Databases
- HanoiDB (⭐298) - Erlang LSM BTree Storage.
- RocksDB - embeddable persistent key-value store for fast storage based on LevelDB.
Business Intelligence
- Jaspersoft - powerful business intelligence suite.
- Microsoft - business intelligence software and platform.
- Pentaho - business intelligence platform.
Data Visualization
- Arbor (⭐2.6k) - graph visualization library using web workers and jQuery.
- Chart.js - open source HTML5 Charts visualizations.
- Cubism (⭐4.9k) - JavaScript library for time series visualization.
- Envisionjs (⭐1.6k) - dynamic HTML5 visualization.
- Matplotlib (⭐17k) - plotting with Python.
- NVD3 - chart components for d3.js.
- Recline (⭐2.1k) - simple but powerful library for building data applications in pure Javascript and HTML.
- Sigma.js (⭐11k) - JavaScript library dedicated to graph drawing.
Interesting Readings
- Big Data Benchmark - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
2013 - 2014
- 2013 - AMPLab - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
- 2013 - AMPLab - MLbase: A Distributed Machine-learning System.
- 2013 - AMPLab - Shark: SQL and Rich Analytics at Scale.
- 2013 - AMPLab - GraphX: A Resilient Distributed Graph System on Spark.
- 2013 - Microsoft - Scalable Progressive Analytics on Big Data in the Cloud.
- 2013 - Metamarkets - Druid: A Real-time Analytical Data Store.
- 2013 - Google - Online, Asynchronous Schema Change in F1.
- 2013 - Google - MillWheel: Fault-Tolerant Stream Processing at Internet Scale.
- 2013 - Facebook - Scuba: Diving into Data at Facebook.
- 2013 - Facebook - Unicorn: A System for Searching the Social Graph.
- 2013 - Facebook - Scaling Memcache at Facebook.
2011 - 2012
- 2012 - AMPLab - Blink and It’s Done: Interactive Queries on Very Large Data.
- 2012 - AMPLab - Fast and Interactive Analytics over Hadoop Data with Spark.
- 2012 - AMPLab - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
- 2012 - Microsoft - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
- 2012 - Microsoft - Paxos Made Parallel.
- 2012 - Google - Processing a trillion cells per mouse click.
- 2011 - AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
- 2011 - AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
2001 - 2010
- 2010 - Facebook - Finding a needle in Haystack: Facebook’s photo storage.
- 2010 - AMPLab - Spark: Cluster Computing with Working Sets.
- 2010 - Google - Pregel: A System for Large-Scale Graph Processing.
- 2007 - Amazon - Dynamo: Amazon’s Highly Available Key-value Store.
- 2006 - Google - Bigtable: A Distributed Storage System for Structured Data.