SEDS 542
Large-Scale Data Management
This course introduces the fundamental concepts and computational paradigms of large-scale data management. This includes major methods for storing, updating and querying large datasets as well as for data-intensive computing. The course covers concepts, algorithms, and system issues on the topics of parallel and distributed databases, peer-to-peer data management, MapReduce and its ecosystem, Spark and dataflows, datalakes and NoSQL databases.
Course Objectives
To introduce students to the current trends in large-scale data management covering concepts, architectures, algorithms and system issues.
Recommended or Required Reading
T. Öszu, P. Valduriez. Principles of Distributed Database Systems. Springer, 4th ed., 2020 ,H. Garcia-Molina, J. D. Ullman, J. Widom. Database Systems: The Complete Book. Prentice Hall, 2nd ed., 2008 ,L. Wiese. Advanced Data Management: For SQL, NoSQL, Cloud and Distributed Databases. De Gruyter, 2015
Learning Outcomes
1. Learn state-of-the-art research and industry trends in Large Scale Data Management Systems,
2. Understand the fundamental principles that govern all modern DBMSs,
3. Be able to make design decisions in deploying large scale data processing applications as well as to identify the bottlenecks of such applications,
4. Learn how to install and use open source systems and libraries in order to perform meaningful large-scale data management tasks.
| Topics |
| Distributed Database Design |
| Distributed Query Processing |
| Distributed Transaction Processing |
| Parallel Architectures and Data Placement |
| Parallel Query Processing |
| Infrastructure and Schema Mapping |
| Querying and Replica Consistency |
| Blockchain |
| Distributed Storage Systems |
| MapReduce and its Ecosystem |
| Spark and Data Flows and DataLakes |
| Key-Value Stores and Document Stores |
| Wide-Column Stores and Graph DBMSs |
| Hybrid Data Stores and Polystores |
Grading
Midterm 20%
Homework 20%
Attendance 20%
Final 40%

