COMP9313_WEEK1_1_课程简介

COMP9313（Big Data Management）课程相关介绍

教师：Doctor. 曹欣

Email：自行搜索

简历：浙江大学计算机学院本科，硕士；南洋理工大学计算机学院博士。

Paper数：22篇

兴趣方向：Data Management (in particular, on geo-textual data), Databases, Information Retrieval, and Data Mining

现在从事研究：

1） Filtering geo-textual data stream, e.g., geo-tagged tweets (SIGMOD13, ICDE15)

2） Keyword-aware route planning (PVLDB12, IJCAI15)

3） Efficient processing of spatial keyword queries (PVLDB10, SIGMOD11, PVLDB14, SIGMOD15, TODS15, PVLDB16, and an invited paper in ER12)

4） Mining significant semantic locations from user generated GPS data (PVLDB10)

5） Link structure analysis (PVLDB10, SIGMOD17)

曾经从事研究：

1）Using categorization information to improve question search in community based question answering services (CIKM09, WWW10, TOIS12)

2）Indoor distance-aware query processing (ICDE12)

3）Streaming graph clustering (ICDE16)

Tutor’s Email: 自行搜索

目的：

This course aims to introduce you to the concepts behind Big Data, the core technologies used in managing large-scale data sets, and a range of technologies for developing solutions to large-scale data analytics problems.

This course is intended for students who want to understand modern large-scale data analytics systems. It covers a wide range of topics and technologies, and will prepare students to be able to build such systems as well as use them efficiently and effectively to address challenges in big data management.

课程lecture：

Lectures focusing on the frontier technologies on big data management and the typical applications

Try to run in more interactive mode and provide more examples

A few lectures may run in more practical manner (e.g., like a lab/demo) to cover the applied aspects

Lecture length varies slightly depending on the progress (of that lecture) l

课本：

1）Hadoop: The Definitive Guide. Tom White. 4th Edition - O'Reilly Media

2）Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeff Ullman. 2nd edition - Cambridge University Press

3）Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris Dyer. University of Maryland, College Park.

4）Learning Spark . Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. O'Reilly Media

参考资料：

1）Apache MapReduce Tutorial

2）Apache Spark Quick Start

课程囊括topics：

1）Topic 1. Big data management tools

Apache Hadoop

MapReduce

YARN/HDFS/HBase/Hive/Pig (briefly introduced)

Spark

AWS platform

Mahout [tentative]

2）Topic 2. Big data typical applications

Finding similar items

Graph data processing

Data stream mining

Recommender Systems

预备知识：

1）have experiences and good knowledge of algorithm design (equivalent to COMP9024 )

2）have a solid background in database systems (equivalent to COMP9311)

3）have solid programming skills in Java

4）be familiar with working on a Unix-style operating systems

5）have basic knowledge of linear algebra (e.g., vector spaces, matrix multiplication), probability theory and statistics , and graph theory

课程预期结果：

1）elaborate the important characteristics of Big Data

2）develop an appropriate storage structure for a Big Data repository

3）utilize the map/reduce paradigm and the to manipulate Big Data

4）utilize the Spark platform to manipulate Big Data

5）develop efficient solutions for analytical problems involving Big Data

课程作业及计分机制：

课程作业及计分机制

4个project：

1 warm-up programming project on Hadoop MapReduce

1 harder project on Hadoop MapReduce

1 project on Spark

1 project on AWS (MapReduce/Spark)

由于CSE电脑的运行环境是Linux，因此：

Use Linux/command line (virtual machine image will be provided)

Projects marked on Linux servers

You need to be able to upload, run, and test your program under Linux

作业上传：

Use Give to submit (either command line or web page)

Classrun. Check your submission, marks, etc. Read https://wiki.cse.unsw.edu.au/give/Classrun

(注意，作业延迟上交，第一天10% penalty，后面按照30%penalty)

Final Exam：

1）Double Pass， final >= 40%

2）Final written exam (100 pts)

课程计划表：

Schedule

Laboratory：（一共11个）

5 labs on MapReduce；3 labs on Spark；1 lab on high level MapReduce tools；1 lab on AWS；1 lab on big data machine learning platform [tentative]

运行环境安装：（使用虚拟机安装）

1）Pure Xubuntu 14.04: <u>http://www.cse.unsw.edu.au/~z3515164/Raw_Xubuntu.zip</u>

2）Xubuntu 14.04 with pre-installed Hadoop and Eclipse plugin: <u>http://mirror.cse.unsw.edu.au/pub/cs9313/Xubuntu.zip</u>

安装步骤：

（1）Download the zip file and uncompress it, and rename the file "xubuntu-disk.vmdk" as "xubuntu-disk2.vmdk“

（2）Open VirtualBox, File->Import Applicance

（3）Browse the image folder, select the "*.ovf" file

（4）The image will be imported to your computer, which may take 10 minutes

（5）comp9313 is used as both username and password. The hadoop installation path is the same as in the virtual machine on lab computers.