Skip to content

dharmeshkakadia/Data-Infra-Projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 

Repository files navigation

Data-Infra-Projects

This is an attempt to list out all the interesting projects.

It is intended for anyone designing modern large scale architectures and need to choose tools/technoglogies/frameworks. The purpose is to help in making that choices with resources like comparisons/use-cases/features/maturity or really anything that helps in making an informed decision.

Abstractions

Distributed Coordination

This are implementations/libraries to help write distributed applications which require some form of coordination.

Infrastructure Management

comparisons

File Systems

Distributed Databases

Infrastrcuture Logging/Monitoring

Infrastructure Helpers

MultiCloud/CrossCloud utilities

Virtualization

Virtualization++

Generalized Data Processing

comparisons

  • Tez vs Dryad
  • Hadoop vs Spark - Too many differences, no good link.

Largescale Distributed ML

pub-sub / messaging

Data Ingest

Data change management

Graph Storing and/or Processing

SQL Engines

Stream Processing

Security

Performance Analysis

Workflow engines/DAG-executors/Pipelines

Comparisons

Configuration Management

Service Discovery

Comparison

Testing

Visualization

Libraries

  • Zoie
  • Norbert - cluster manager and networking layer built on top of Zookeeper.
  • Okapi - Large-scale ML & graph analytics on Giraph
  • Scalding - A Scala API for Cascading
  • SummingBird - Streaming MapReduce with Scalding and Storm
  • Curator - set of Java libraries that make using Apache ZooKeeper much easier
  • Turbine - Low latency high throughput aggregator for real time streams
  • DataFu - Collection of MapReduce lib
  • Twill (Previsously known as Weave) - YARN application writing lib

Search

others

  • Nutch - web crawler
  • Ambari - Hadoop Deployment + Management
  • Bigtop - Hadoop Packaging
  • Skuld
  • Camus - LinkedIn's Kafka to HDFS pipeline.
  • Kiji - collect, analyze and serve data in real time on Apache Hadoop and HBase

About

List of some interesting projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5