MapReduce: Simplified Data Processing on Large Clusters acm pdf
What is Map Reduce?
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
..in particular, it “runs on a large cluster of** commodity machines** and is highly scalable.”
Who built it?
Google, for their search indexing.
Why was Map Reduce successful?
Easy to use, expressive (to an extent), scaleable implementation.
- it hides the details of parallelization, fault-tolerance, locality optimization, and load balancing.
- MapReduce is used for the generation of data for Google’s production web search service, for sorting, for data mining, for machine learning.
- scales to large clusters of machines comprising thousands of machines.
What were the takeaways from Map Reduce?
- Restricting the programming model makes it easy to parallelize and distribute computations and to make such computations fault-tolerant.
- Network bandwidth is a scarce resource. Optimizations targeted at reducing the amount of data sent across the network: the locality optimization allows us to read data from local disks, and writing a single copy of the intermediate data to local disk saves network bandwidth. Cheaper to send code to data than sending data to code.
- Redundant execution can be used to reduce the impact of slow machines, and to handle machine failures and data loss.
What is map reduce good for?
What is it not good for?
- Iterative algorithms
- Interactive queries
Both have incredibly slow perf.
What influences did MR have on later systems and usage today?
- It influenced DryadLINQ, which in turn inspired Spark.
- Basically pioneered the idea of doing large scale computations over distributed commodity clusters.