Hadoop - MapReduce

MapReduce Concept

MapReduce is natively Java, allowed to interface with Python under Streaming

  • Like dictionary in data structure


    image.png
  • Shuffle Keys and Sort Values


    image.png
  • Reducer


    image.png
  • So overall operation is like:


    image.png
  • Handle of failure


    image.png

MapReduce Coding

  • Problem:


    image.png
  • Map Function:

def mapper_get_ratings(self, _, line):
    (userID, movieID,rating,Timestamp)=line.split('\t')
     yield rating,1

  • Reduce Funtion:

def reducer_count_ratings(self, key, values):
    yield key, sum(values)

  • After putting them together:


    image.png

Installation & Preparation

  • After log into the command line interface:


    image.png
  • Run locally:

python RatingsBreakdown.py u.data

  • Run with Hadoop:

python RatingsBreakdown.py -r hadoop --hadoop-streaming-jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar u.data

  • Result should be :


    image.png
  • Plus, codes for more complex problems (sorted for movie numbers):


    image.png
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容