SEDS 542

Large-Scale Data Management

This course introduces the fundamental concepts and computational paradigms of large-scale data management. This includes major methods for storing, updating and querying large datasets as well as for data-intensive computing. The course covers concepts, algorithms, and system issues on the topics of parallel and distributed databases, peer-to-peer data management, MapReduce and its ecosystem, Spark and dataflows, datalakes and NoSQL databases.

Course Objectives

To introduce students to the current trends in large-scale data management covering concepts, architectures, algorithms and system issues.

Recommended or Required Reading

T. Öszu, P. Valduriez. Principles of Distributed Database Systems. Springer, 4th ed., 2020 ,H. Garcia-Molina, J. D. Ullman, J. Widom. Database Systems: The Complete Book. Prentice Hall, 2nd ed., 2008 ,L. Wiese. Advanced Data Management: For SQL, NoSQL, Cloud and Distributed Databases. De Gruyter, 2015

Learning Outcomes

1. Learn state-of-the-art research and industry trends in Large Scale Data Management Systems,

2. Understand the fundamental principles that govern all modern DBMSs,

3. Be able to make design decisions in deploying large scale data processing applications as well as to identify the bottlenecks of such applications,

4. Learn how to install and use open source systems and libraries in order to perform meaningful large-scale data management tasks.

Topics
Distributed Database Design
Distributed Query Processing
Distributed Transaction Processing
Parallel Architectures and Data Placement
Parallel Query Processing
Infrastructure and Schema Mapping
Querying and Replica Consistency
Blockchain
Distributed Storage Systems
MapReduce and its Ecosystem
Spark and Data Flows and DataLakes
Key-Value Stores and Document Stores
Wide-Column Stores and Graph DBMSs
Hybrid Data Stores and Polystores

Grading

Midterm 20%

Homework 20%

Attendance 20%

Final 40%