What is MapReduce ?
Parallel programming model for big data processing:
split data> chunks
define steps to process chunks
process the chunks parallelly
Hadoop is a platform implements MapReduce .
1. Map
<key1, value1> -> <key2, value2>
eg: <line#, text string > -> < word, count>
After mapping, the oupput is passed to Reduce part
2. Reduce
Merge/Reduce the output of Mapping phase, which is optional .
The output of MapReduce could be printed, Summed, Counted , loaded to DB or sent to next MapReduce job
Idea: MapReduce , massive unstructured data storage
Physical: Jave classes for and The Hadoop Distributed file System
Hadoop Operational Modes
Java MapReduce Mode: read record incrementally
Streaming Mode: Any language, input can be a line or stream
Query Languages for Hadoop
Builds on core Hadoop to enhanve the development and manpulation of Hadoop cluster
Pig:Data flow language and execution enviroment
Hive(HiveQL) Query language based on SQL for building MapReduced jobs
HBase Column oriented database
Pig(Data flow language in Latin)
2 Execution environment modes:
Local flie system
MapReduce in Hadoop environment
Suitable for large dataset and batch processing