Skip to content

Latest commit



302 lines (219 loc) · 19.1 KB

File metadata and controls

302 lines (219 loc) · 19.1 KB

Designing Data-Intensive Applications

Chapter 10 References

  1. Jeffrey Dean and Sanjay Ghemawat: “MapReduce: Simplified Data Processing on Large Clusters,” at 6th USENIX Symposium on Operating System Design and Implementation (OSDI), December 2004.

  2. Joel Spolsky: “The Perils of JavaSchools,”, December 25, 2005.

  3. Shivnath Babu and Herodotos Herodotou: “Massively Parallel Databases and MapReduce Systems,” Foundations and Trends in Databases, volume 5, number 1, pages 1–104, November 2013. doi:10.1561/1900000036

  4. David J. DeWitt and Michael Stonebraker: “MapReduce: A Major Step Backwards,” originally published at, January 17, 2008.

  5. Henry Robinson: “The Elephant Was a Trojan Horse: On the Death of Map-Reduce at Google,”, June 25, 2014.

  6. The Hollerith Machine,” United States Census Bureau,

  7. IBM 82, 83, and 84 Sorters Reference Manual,” Edition A24-1034-1, International Business Machines Corporation, July 1962.

  8. Adam Drake: “Command-Line Tools Can Be 235x Faster than Your Hadoop Cluster,”, January 25, 2014.

  9. GNU Coreutils 8.23 Documentation,” Free Software Foundation, Inc., 2014.

  10. Martin Kleppmann: “Kafka, Samza, and the Unix Philosophy of Distributed Data,”, August 5, 2015.

  11. Doug McIlroy: Internal Bell Labs memo, October 1964. Cited in: Dennis M. Richie: “Advice from Doug McIlroy,”

  12. M. D. McIlroy, E. N. Pinson, and B. A. Tague: “UNIX Time-Sharing System: Foreword,” The Bell System Technical Journal, volume 57, number 6, pages 1899–1904, July 1978.

  13. Eric S. Raymond: The Art of UNIX Programming. Addison-Wesley, 2003. ISBN: 978-0-13-142901-7

  14. Ronald Duncan: “Text File Formats – ASCII Delimited Text – Not CSV or TAB Delimited Text,”, October 31, 2009.

  15. Alan Kay: “Is 'Software Engineering' an Oxymoron?,”

  16. Martin Fowler: “InversionOfControl,”, June 26, 2005.

  17. Daniel J. Bernstein: “Two File Descriptors for Sockets,”

  18. Rob Pike and Dennis M. Ritchie: “The Styx Architecture for Distributed Systems,” Bell Labs Technical Journal, volume 4, number 2, pages 146–152, April 1999.

  19. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: “The Google File System,” at 19th ACM Symposium on Operating Systems Principles (SOSP), October 2003. doi:10.1145/945445.945450

  20. Michael Ovsiannikov, Silvius Rus, Damian Reeves, et al.: “The Quantcast File System,” Proceedings of the VLDB Endowment, volume 6, number 11, pages 1092–1101, August 2013. doi:10.14778/2536222.2536234

  21. OpenStack Swift 2.6.1 Developer Documentation,” OpenStack Foundation,, March 2016.

  22. Zhe Zhang, Andrew Wang, Kai Zheng, et al.: “Introduction to HDFS Erasure Coding in Apache Hadoop,”, September 23, 2015.

  23. Peter Cnudde: “Hadoop Turns 10,”, February 5, 2016.

  24. Eric Baldeschwieler: “Thinking About the HDFS vs. Other Storage Technologies,”, July 25, 2012.

  25. Brendan Gregg: “Manta: Unix Meets Map Reduce,”, June 25, 2013.

  26. Tom White: Hadoop: The Definitive Guide, 4th edition. O'Reilly Media, 2015. ISBN: 978-1-491-90163-2

  27. Jim N. Gray: “Distributed Computing Economics,” Microsoft Research Tech Report MSR-TR-2003-24, March 2003.

  28. Márton Trencséni: “Luigi vs Airflow vs Pinball,”, February 6, 2016.

  29. Roshan Sumbaly, Jay Kreps, and Sam Shah: “The 'Big Data' Ecosystem at LinkedIn,” at ACM International Conference on Management of Data (SIGMOD), July 2013. doi:10.1145/2463676.2463707

  30. Alan F. Gates, Olga Natkovich, Shubham Chopra, et al.: “Building a High-Level Dataflow System on Top of Map-Reduce: The Pig Experience,” at 35th International Conference on Very Large Data Bases (VLDB), August 2009.

  31. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, et al.: “Hive – A Petabyte Scale Data Warehouse Using Hadoop,” at 26th IEEE International Conference on Data Engineering (ICDE), March 2010. doi:10.1109/ICDE.2010.5447738

  32. Cascading 3.0 User Guide,” Concurrent, Inc.,, January 2016.

  33. Apache Crunch User Guide,” Apache Software Foundation,

  34. Craig Chambers, Ashish Raniwala, Frances Perry, et al.: “FlumeJava: Easy, Efficient Data-Parallel Pipelines,” at 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 2010. doi:10.1145/1806596.1806638

  35. Jay Kreps: “Why Local State is a Fundamental Primitive in Stream Processing,”, July 31, 2014.

  36. Martin Kleppmann: “Rethinking Caching in Web Apps,”, October 1, 2012.

  37. Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira: Hadoop Application Architectures. O'Reilly Media, 2015. ISBN: 978-1-491-90004-8

  38. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “Challenges to Adopting Stronger Consistency at Scale,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

  39. Sriranjan Manjunath: “Skewed Join,”, 2009.

  40. David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri: “Practical Skew Handling in Parallel Joins,” at 18th International Conference on Very Large Data Bases (VLDB), August 1992.

  41. Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: “Impala: A Modern, Open-Source SQL Engine for Hadoop,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

  42. Matthieu Monsch: “Open-Sourcing PalDB, a Lightweight Companion for Storing Side Data,”, October 26, 2015.

  43. Daniel Peng and Frank Dabek: “Large-Scale Incremental Processing Using Distributed Transactions and Notifications,” at 9th USENIX conference on Operating Systems Design and Implementation (OSDI), October 2010.

  44. "Cloudera Search User Guide," Cloudera, Inc., September 2015.

  45. Lili Wu, Sam Shah, Sean Choi, et al.: “The Browsemaps: Collaborative Filtering at LinkedIn,” at 6th Workshop on Recommender Systems and the Social Web (RSWeb), October 2014.

  46. Roshan Sumbaly, Jay Kreps, Lei Gao, et al.: “Serving Large-Scale Batch Computed Data with Project Voldemort,” at 10th USENIX Conference on File and Storage Technologies (FAST), February 2012.

  47. Varun Sharma: “Open-Sourcing Terrapin: A Serving System for Batch Generated Data,”, September 14, 2015.

  48. Nathan Marz: “ElephantDB,”, May 30, 2011.

  49. Jean-Daniel (JD) Cryans: “How-to: Use HBase Bulk Loading, and Why,”, September 27, 2013.

  50. Nathan Marz: “How to Beat the CAP Theorem,”, October 13, 2011.

  51. Molly Bartlett Dishman and Martin Fowler: “Agile Architecture,” at O'Reilly Software Architecture Conference, March 2015.

  52. David J. DeWitt and Jim N. Gray: “Parallel Database Systems: The Future of High Performance Database Systems,” Communications of the ACM, volume 35, number 6, pages 85–98, June 1992. doi:10.1145/129888.129894

  53. Jay Kreps: “But the multi-tenancy thing is actually really really hard,” tweetstorm,, October 31, 2014.

  54. Jeffrey Cohen, Brian Dolan, Mark Dunlap, et al.: “MAD Skills: New Analysis Practices for Big Data,” Proceedings of the VLDB Endowment, volume 2, number 2, pages 1481–1492, August 2009. doi:10.14778/1687553.1687576

  55. Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino: “Data Wrangling: The Challenging Journey from the Wild to the Lake,” at 7th Biennial Conference on Innovative Data Systems Research (CIDR), January 2015.

  56. Paige Roberts: “To Schema on Read or to Schema on Write, That Is the Hadoop Data Lake Question,”, July 2, 2015.

  57. Bobby Johnson and Joseph Adler: “The Sushi Principle: Raw Data Is Better,” at Strata+Hadoop World, February 2015.

  58. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, et al.: “Apache Hadoop YARN: Yet Another Resource Negotiator,” at 4th ACM Symposium on Cloud Computing (SoCC), October 2013. doi:10.1145/2523616.2523633

  59. Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, et al.: “Large-Scale Cluster Management at Google with Borg,” at 10th European Conference on Computer Systems (EuroSys), April 2015. doi:10.1145/2741948.2741964

  60. Malte Schwarzkopf: “The Evolution of Cluster Scheduler Architectures,”, March 9, 2016.

  61. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al.: “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” at 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), April 2012.

  62. Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia: Learning Spark. O'Reilly Media, 2015. ISBN: 978-1-449-35904-1

  63. Bikas Saha and Hitesh Shah: “Apache Tez: Accelerating Hadoop Query Processing,” at Hadoop Summit, June 2014.

  64. Bikas Saha, Hitesh Shah, Siddharth Seth, et al.: “Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2742790

  65. Kostas Tzoumas: “Apache Flink: API, Runtime, and Project Roadmap,”, January 14, 2015.

  66. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al.: “The Stratosphere Platform for Big Data Analytics,” The VLDB Journal, volume 23, number 6, pages 939–964, May 2014. doi:10.1007/s00778-014-0357-y

  67. Michael Isard, Mihai Budiu, Yuan Yu, et al.: “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks,” at European Conference on Computer Systems (EuroSys), March 2007. doi:10.1145/1272996.1273005

  68. Daniel Warneke and Odej Kao: “Nephele: Efficient Parallel Data Processing in the Cloud,” at 2nd Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS), November 2009. doi:10.1145/1646468.1646476

  69. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd: “The PageRank

  70. Leslie G. Valiant: “A Bridging Model for Parallel Computation,” Communications of the ACM, volume 33, number 8, pages 103–111, August 1990. doi:10.1145/79173.79181

  71. Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl: “Spinning Fast Iterative Data Flows,” Proceedings of the VLDB Endowment, volume 5, number 11, pages 1268-1279, July 2012. doi:10.14778/2350229.2350245

  72. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, et al.: “Pregel: A System for Large-Scale Graph Processing,” at ACM International Conference on Management of Data (SIGMOD), June 2010. doi:10.1145/1807167.1807184

  73. Frank McSherry, Michael Isard, and Derek G. Murray: “Scalability! But at What COST?,” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS), May 2015.

  74. Ionel Gog, Malte Schwarzkopf, Natacha Crooks, et al.: “Musketeer: All for One, One for All in Data Processing Systems,” at 10th European Conference on Computer Systems (EuroSys), April 2015. doi:10.1145/2741948.2741968

  75. Aapo Kyrola, Guy Blelloch, and Carlos Guestrin: “GraphChi: Large-Scale Graph Computation on Just a PC,” at 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), October 2012.

  76. Andrew Lenharth, Donald Nguyen, and Keshav Pingali: “Parallel Graph Analytics,” Communications of the ACM, volume 59, number 5, pages 78–87, May 2016. doi:10.1145/2901919

  77. Fabian Hüske: “Peeking into Apache Flink's Engine Room,”, March 13, 2015.

  78. Mostafa Mokhtar: “Hive 0.14 Cost Based Optimizer (CBO) Technical Overview,”, March 2, 2015.

  79. Michael Armbrust, Reynold S Xin, Cheng Lian, et al.: “Spark SQL: Relational Data Processing in Spark,” at ACM International Conference on Management of Data (SIGMOD), June 2015. doi:10.1145/2723372.2742797

  80. Daniel Blazevski: “Planting Quadtrees for Apache Flink,”, March 25, 2016.

  81. Tom White: “Genome Analysis Toolkit: Now Using Apache Spark for Data Processing,”, April 6, 2016.