讲解:CSI4142、Electrical、SQL、SQLSQL|SQL讲解:CSE 482、Data Analysis、java、 javaSPSS|Python

CSE 482: Big Data Analysis (Spring 2020) Homework 2Due date: Monday, February 19, 2020Please make sure you submit a PDF version of your homework via D2L.1. Write the corresponding HDFS commands to perform the tasks describedfor each question below. Type hadoop fs -help for the list of HDFScommands available. You can also refer to the documentation available athttps://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html. To double-check your answers, you should testthe commands to make sure they work correctly.(a) Suppose you are connected to a master node on AWS running hadoopwith Linux operating system. Assume you have created a data directorynamed logs (on the Linux filesystem of the master node),which currently contains 1000 Web log files to be processed. Writethe hadoop DFS commands needed to upload all the Web log filesfrom the logs directory to the directory named /user/hadoop/dataon HDFS. Assume the /user/hadoop/data directory has not existedyet on HDFS. Therefore, you need to create the directory first beforetransferring the files.(b) Write the HDFS command to move the Web log files from the /user/hadoop/data directory on HDFS to a shared directory named /user/share/ on HDFS. After the move, all the files should now be locatedin the /user/share/data/ directory. Write the HDFS command tolist all the files and subdirectories located in /user/share directory.To make sure the files have been moved, write the correspondingHDFS command to list all the files and subdirectories located in thedirectory named /user/hadoop to verify that the data subdirectoryno longer exists.(c) Suppose one of the files located in the /user/share/data/ directorynamed 2020-01-01.txt is corrupted. You need to replace thecorrupted file with a new file named 2020-01-01-new.txt, whichis currently located in the logs/new directory on the local (Linux)filesystem of the AWS master node. Write the HDFS commandsto (1) delete the corrupted file from /user/share/data/ directoryon HDFS, (2) Upload the new file from logs/new directory to the/user/share/data/ directory on HDFS, and (3) rename the new fileon HDFS from 2020-01-01-new.txt to 2020-01-01.txt.(d) Write the HDFS command to display the content of the file 2020-01-01.txt, which is currently stored in the /user/share/data/ directoryon HDFS. As the file is huge, write another HDFS command to displaythe last kilobyte of the file to standard output.12. Consider a Hadoop program written to solve each computational problemand dataset described below. State how would you setup the (key,value)pairs as inputs and outputs of its mapper and reducer classes. Assumeyour Hadoop program uses TextInputFormat as its input format (whereeach record corresponds to a line of the input file). Since the inputs for themappers are the same (byte offset, content of the line) for all the problemsbelow, you only have to specify the mappers’ outputs as well as reducers’inputs and outputs. You must also explain the operations performed bythe map and reduce functions of the Hadoop program. If the problemrequires more than one mapreduce jobs, you should explain what each jobis trying to do along with its input and output key-value pairs. You shouldsolve the computation problem with minimum number of mapreduce jobs.Example:Data set: Collections of text documents.Problem: Count the frequency of nouns that appear at least 100 times inthe documents.Answer:(i) Mapper function: Tokenize each line into a set of terCSE 482作业代做、代写Data Analysis作业、代写java语言作业、 代做java程序设计作业 代做SPSms (words), and filter outterms that are not nouns.(ii) Mapper output: key is a noun, value is 1.(iii) Reducer input: key is a word, value is list of 1’s.(iv) Reduce function: sums up the 1’s for each key (noun).(v) Reducer output: key is a noun, value is frequency of the word (filter the nounswhose frequencies are below 100).(a) Data set: Car for sale data. Each line in the data file has 5 columns(seller id, car make, car model, car year, price). For example:1234,honda,accord,2010,105002331,ford,taurus,2005,2400Problem: Find the median price (over all years) for each make andmodel of vehicle. For example, the median price for ford taurus couldbe 8000.(b) Data set: Netflix movie rental data. Each record in the data filecontains the following 4 columns: userID, rental date, movie title,movie genre. For example, the recorduser111 12-20-2019 star_wars scifiuser111 12-21-2019 aladdin animationuser111 12-25-2019 lion_king animationProblem: Find the favorite movie genre of each user. In the aboveexample, the favorite genre for user111 is animation.2(c) Data set: Youtube subscriber data. Each line in the data file isa 2-tuple (user, subscriber). For example, the following lines in thedata file:john maryjohn bobmary johnshow that mary and bob are subscribers of John’s Youtube videos.Problem: Find all pairs of users who subscribe to each others’videos. In the example above, john and mary are such pair of subscribers,but john and bob are not (since john does not subscribe tobob’s videos)(d) Data set: Loan applicant data. Each line in the data file containsthe following attributes: marital status, age group, employment status,home ownership, credit rating, and class (approve/reject).single, 18-25, employed, none, poor, reject.single, 25-45, employed, yes, good, approve.Problem: Compute the entropy of each attribute (marital status,age group, etc) with respect to the class variable.(e) Data set: Document data. Each record in the dataset correspondsto a document with its ID and set of words that appear in the document.For example, the following records contain the set of wordsthat appear in documents 12345, 12346, and 12347, respectively.12345 team won goal result12346 political party won election result12347 lunch party restaurantProblem: Compute the cosine similarity between every pair of documentsin the dataset. Given a pair of documents, say, u and v, theircosine similarity is computed as follows:cosine(u, v) = nuv √nu × nv,where nuv is the number of words that appear in both u and v, nuis the number of words that appear in document u and nv is thenumber of words that appear in document v. For the above example,cosine(12345,12346) = 2/√20 whereas cosine(12346,12347) = 1/√15.Hint: You will need two mapreduce (Hadoop) jobs for this problem.33. Download the data file Titanic.csv from the class Web site. Each linein the data file has the following comma-separated attribute values:PassengerGroup,Age,Gender,OutcomeFor this question, you need to write a Hadoop program that computes themutual information between every pair of attributes. The reducer outputwill contain the following key-value pair:• key is name of attribute pair, e.g., (Age, Outcome).• Value is the their mutual information.Deliverable: Your hadoop source code (*.java), the archived (jar) files,and the reducer output file, which must have 2 tab-separated columns:attribute pair and its mutual information value.4转自:http://www.3zuoye.com/contents/9/4806.html

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 214,504评论 6 496
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,434评论 3 389
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 160,089评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,378评论 1 288
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,472评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,506评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,519评论 3 413
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,292评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,738评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,022评论 2 329
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,194评论 1 342
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,873评论 5 338
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,536评论 3 322
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,162评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,413评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,075评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,080评论 2 352

推荐阅读更多精彩内容