CENG 544

Large-Scale Data Management

This course introduces the fundamental concepts and computational paradigms of large-scale data management. This includes major methods for storing, updating and querying large datasets as well as for data-intensive computing. The course covers concepts, algorithms, and system issues on the topics of parallel and distributed databases, peer-to-peer data management, MapReduce and its ecosystem, Spark and dataflows, datalakes and NoSQL databases.

Course Objectives

To introduce students to the current trends in large-scale data management covering concepts, architectures, algorithms and system issues.

Recommended or Required Reading

T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Springer, 4th ed., 2020.

M. Kleppman. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media, Inc., 2017.

Learning Outcomes

  1. To understand current research and technological trends in large-scale data management
  2. To comprehend the fundamental principles of modern database management systems
  3. To identify bottlenecks in large-scale data management applications and make appropriate design decisions
  4. To install and utilize open-source software systems and libraries required for meaningful data management operations
Week Topics
1 Distributed Database Systems

Distributed Database Design

2 Distributed Query Processing
3 Distributed Transaction Processing
4 Parallel Database Systems

Parallel Architectures and Data Placement

5 Parallel Query Processing
6 Peer-to-Peer Data Management

Infrastructure and Schema Mapping

7 Querying and Replica Consistency
8 Blockchain
9 Big Data Processing

Distributed Storage Systems

10 MapReduce and its Ecosystem
11 Spark and Data Flows and DataLakes
12 NOSQL, NewSQL and Polystores

Key-Value Stores and Document Stores

13 Wide-Column Stores and Graph DBMSs
14 Hybrid Data Stores and Polystores

Midterm: 30%

Research Presentation: 30%

Final: 40%