COMP9313(Big Data Management)课程相关介绍
教师:Doctor. 曹欣
Email:自行搜索
简历:浙江大学计算机学院本科,硕士;南洋理工大学计算机学院博士。
Paper数:22篇
兴趣方向:Data Management (in particular, on geo-textual data), Databases, Information Retrieval, and Data Mining
现在从事研究:
1) Filtering geo-textual data stream, e.g., geo-tagged tweets (SIGMOD13, ICDE15)
2) Keyword-aware route planning (PVLDB12, IJCAI15)
3) Efficient processing of spatial keyword queries (PVLDB10, SIGMOD11, PVLDB14, SIGMOD15, TODS15, PVLDB16, and an invited paper in ER12)
4) Mining significant semantic locations from user generated GPS data (PVLDB10)
5) Link structure analysis (PVLDB10, SIGMOD17)
曾经从事研究:
1)Using categorization information to improve question search in community based question answering services (CIKM09, WWW10, TOIS12)
2)Indoor distance-aware query processing (ICDE12)
3)Streaming graph clustering (ICDE16)
Tutor’s Email: 自行搜索
目的:
This course aims to introduce you to the concepts behind Big Data, the core technologies used in managing large-scale data sets, and a range of technologies for developing solutions to large-scale data analytics problems.
This course is intended for students who want to understand modern large-scale data analytics systems. It covers a wide range of topics and technologies, and will prepare students to be able to build such systems as well as use them efficiently and effectively to address challenges in big data management.
课程lecture:
Lectures focusing on the frontier technologies on big data management and the typical applications
Try to run in more interactive mode and provide more examples
A few lectures may run in more practical manner (e.g., like a lab/demo) to cover the applied aspects
Lecture length varies slightly depending on the progress (of that lecture) l
课本:
1)Hadoop: The Definitive Guide. Tom White. 4th Edition - O'Reilly Media
2)Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeff Ullman. 2nd edition - Cambridge University Press
3)Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris Dyer. University of Maryland, College Park.
4)Learning Spark . Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. O'Reilly Media
参考资料:
1)Apache MapReduce Tutorial
2)Apache Spark Quick Start
课程囊括topics:
1)Topic 1. Big data management tools
Apache Hadoop
MapReduce
YARN/HDFS/HBase/Hive/Pig (briefly introduced)
Spark
AWS platform
Mahout [tentative]
2)Topic 2. Big data typical applications
Finding similar items
Graph data processing
Data stream mining
Recommender Systems
预备知识:
1)have experiences and good knowledge of algorithm design (equivalent to COMP9024 )
2)have a solid background in database systems (equivalent to COMP9311)
3)have solid programming skills in Java
4)be familiar with working on a Unix-style operating systems
5)have basic knowledge of linear algebra (e.g., vector spaces, matrix multiplication), probability theory and statistics , and graph theory
课程预期结果:
1)elaborate the important characteristics of Big Data
2)develop an appropriate storage structure for a Big Data repository
3)utilize the map/reduce paradigm and the to manipulate Big Data
4)utilize the Spark platform to manipulate Big Data
5)develop efficient solutions for analytical problems involving Big Data
课程作业及计分机制:
4个project:
1 warm-up programming project on Hadoop MapReduce
1 harder project on Hadoop MapReduce
1 project on Spark
1 project on AWS (MapReduce/Spark)
由于CSE电脑的运行环境是Linux,因此:
Use Linux/command line (virtual machine image will be provided)
Projects marked on Linux servers
You need to be able to upload, run, and test your program under Linux
作业上传:
Use Give to submit (either command line or web page)
Classrun. Check your submission, marks, etc. Read https://wiki.cse.unsw.edu.au/give/Classrun
(注意,作业延迟上交,第一天10% penalty,后面按照30%penalty)
Final Exam:
1)Double Pass, final >= 40%
2)Final written exam (100 pts)
课程计划表:
Laboratory:(一共11个)
5 labs on MapReduce;3 labs on Spark;1 lab on high level MapReduce tools;1 lab on AWS;1 lab on big data machine learning platform [tentative]
运行环境安装:(使用虚拟机安装)
1)Pure Xubuntu 14.04: <u>http://www.cse.unsw.edu.au/~z3515164/Raw_Xubuntu.zip</u>
2)Xubuntu 14.04 with pre-installed Hadoop and Eclipse plugin: <u>http://mirror.cse.unsw.edu.au/pub/cs9313/Xubuntu.zip</u>
安装步骤:
(1)Download the zip file and uncompress it, and rename the file "xubuntu-disk.vmdk" as "xubuntu-disk2.vmdk“
(2)Open VirtualBox, File->Import Applicance
(3)Browse the image folder, select the "*.ovf" file
(4)The image will be imported to your computer, which may take 10 minutes
(5)comp9313 is used as both username and password. The hadoop installation path is the same as in the virtual machine on lab computers.