COMP9313(Big Data Management)课程相关介绍
教师:Doctor. 曹欣
兴趣方向:Data Management (in particular, on geo-textual data), Databases, Information Retrieval, and Data Mining
1) Filtering geo-textual data stream, e.g., geo-tagged tweets (SIGMOD13, ICDE15)
2) Keyword-aware route planning (PVLDB12, IJCAI15)
3) Efficient processing of spatial keyword queries (PVLDB10, SIGMOD11, PVLDB14, SIGMOD15, TODS15, PVLDB16, and an invited paper in ER12)
4) Mining significant semantic locations from user generated GPS data (PVLDB10)
5) Link structure analysis (PVLDB10, SIGMOD17)
1)Using categorization information to improve question search in community based question answering services (CIKM09, WWW10, TOIS12)
2)Indoor distance-aware query processing (ICDE12)
3)Streaming graph clustering (ICDE16)
Tutor’s Email: 自行搜索
This course aims to introduce you to the concepts behind Big Data, the core technologies used in managing large-scale data sets, and a range of technologies for developing solutions to large-scale data analytics problems.
This course is intended for students who want to understand modern large-scale data analytics systems. It covers a wide range of topics and technologies, and will prepare students to be able to build such systems as well as use them efficiently and effectively to address challenges in big data management.
Lectures focusing on the frontier technologies on big data management and the typical applications
Try to run in more interactive mode and provide more examples
A few lectures may run in more practical manner (e.g., like a lab/demo) to cover the applied aspects
Lecture length varies slightly depending on the progress (of that lecture) l
1)Hadoop: The Definitive Guide. Tom White. 4th Edition - O'Reilly Media
2)Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeff Ullman. 2nd edition - Cambridge University Press
3)Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris Dyer. University of Maryland, College Park.
4)Learning Spark . Matei Zaharia, Holden Karau, Andy Konwinski, Patrick Wendell. O'Reilly Media
1)Apache MapReduce Tutorial
2)Apache Spark Quick Start
1)Topic 1. Big data management tools
Apache Hadoop
YARN/HDFS/HBase/Hive/Pig (briefly introduced)
AWS platform
Mahout [tentative]
2)Topic 2. Big data typical applications
Finding similar items
Graph data processing
Data stream mining
Recommender Systems
1)have experiences and good knowledge of algorithm design (equivalent to COMP9024 )
2)have a solid background in database systems (equivalent to COMP9311)
3)have solid programming skills in Java
4)be familiar with working on a Unix-style operating systems
5)have basic knowledge of linear algebra (e.g., vector spaces, matrix multiplication), probability theory and statistics , and graph theory
1)elaborate the important characteristics of Big Data
2)develop an appropriate storage structure for a Big Data repository
3)utilize the map/reduce paradigm and the to manipulate Big Data
4)utilize the Spark platform to manipulate Big Data
5)develop efficient solutions for analytical problems involving Big Data
1 warm-up programming project on Hadoop MapReduce
1 harder project on Hadoop MapReduce
1 project on Spark
1 project on AWS (MapReduce/Spark)
Use Linux/command line (virtual machine image will be provided)
Projects marked on Linux servers
You need to be able to upload, run, and test your program under Linux
Use Give to submit (either command line or web page)
Classrun. Check your submission, marks, etc. Read
(注意,作业延迟上交,第一天10% penalty,后面按照30%penalty)
Final Exam:
1)Double Pass, final >= 40%
2)Final written exam (100 pts)
5 labs on MapReduce;3 labs on Spark;1 lab on high level MapReduce tools;1 lab on AWS;1 lab on big data machine learning platform [tentative]
1)Pure Xubuntu 14.04: <u></u>
2)Xubuntu 14.04 with pre-installed Hadoop and Eclipse plugin: <u></u>
(1)Download the zip file and uncompress it, and rename the file "xubuntu-disk.vmdk" as "xubuntu-disk2.vmdk“
(2)Open VirtualBox, File->Import Applicance
(3)Browse the image folder, select the "*.ovf" file
(4)The image will be imported to your computer, which may take 10 minutes
(5)comp9313 is used as both username and password. The hadoop installation path is the same as in the virtual machine on lab computers.