MapReduce Concept
MapReduce is natively Java, allowed to interface with Python under Streaming
-
Like dictionary in data structure
image.png -
Shuffle Keys and Sort Values
image.png -
Reducer
image.png -
So overall operation is like:
image.png -
Handle of failure
image.png
MapReduce Coding
-
Problem:
image.png Map Function:
def mapper_get_ratings(self, _, line):
(userID, movieID,rating,Timestamp)=line.split('\t')
yield rating,1
- Reduce Funtion:
def reducer_count_ratings(self, key, values):
yield key, sum(values)
-
After putting them together:
image.png
Installation & Preparation
-
After log into the command line interface:
image.png Run locally:
python RatingsBreakdown.py u.data
- Run with Hadoop:
python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar u.data
-
Result should be :
image.png -
Plus, codes for more complex problems (sorted for movie numbers):
image.png