Hadoop - Pig

Ambari

  • Ambari provides Dashboard:


    image.png
  • Enable Admin Users:

ssh yourname@127.0.0.1 -p 2222
su root
ambari-admin-password-reset

Pig Concept

image.png
  • Usage of Pig:
  1. Grunt
  2. Script
  3. Ambari

example -> find the oldest 5-star movie

  • New Script in Pig View
image.png
  • Load data:

ratings = LOAD 'ml-100k/u.data' AS (userID:int, movieID:int, rating:int, ratingTime: int);

metadata = LOAD 'ml-100k/u.item' USING PigStorage('|') AS (movieID:int, movieTitle:chararray, releaseDate:chararray,
videoRelease:chararray, imdbLink:chararray);

  • FOREACH/GENERATE:

nameLookup = FOREACH metadata GENERATE movieTitle, ToUnixTime(ToDate(releaseDate, 'dd-MM-yyyy')) AS releaseTime;

  • Group By

ratingsByMovie = Group ratings BY movieID;

*Return Result:

avgRatings = Foreach ratingsByMovie Generate group AS movieID, AVG(ratings.rating) AS avgRating;
fiveStarMovies= Filter avgRatings By avgRating > 4.0;
fiveStarsWithData = join fiveStarMovies by movieID, nameLookup by movieID;
oldestFiveStarMovie = order fiveStarsWithData by nameLookup::releaseTime;

dump oldestFiveStarMovie;

  • Result: (Runtime - several minutes)


    image.png
  • With Tez, it can shrink into 1 minute


    image.png
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容