1. Types of Data

Classification

Data can be broadly classified into four types:

  1. Structured Data:

    • Has a predefined data model (schema).
    • Organized in a way that’s easy to store, process, retrieve, and manage.
    • Example: Relational data in RDBMS tables.
  2. Unstructured Data:

    • Opposite of structured data; lacks a predefined model.
    • Examples: Flat binary files like text documents, video, audio.
    • Note: Not completely devoid of structure. An audio file has encoding structure and metadata.
  3. Dynamic Data:

    • Data that changes relatively frequently.
    • Examples: Office documents being edited, transactional entries in financial databases.
  4. Static Data:

    • Opposite of dynamic data; changes infrequently.
    • Examples: Medical imaging data (MRI/CT scans), archived documents.

Why Classify Data?

Segmenting data helps in designing appropriate storage solutions. A common way to visualize this is a 2x2 quadrant:

*Dynamic-*Static-
*Unstructured-Media Production, eCAD, mCAD, Office DocsMedia Archive, Broadcast, Medical Imaging
*Structured-Transaction Systems, ERP, CRMBI, Data Warehousing
  • Relational Databases (RDBMS): Typically used for Structured Data (often Dynamic, like transactions, but also Static, like data warehouses).
  • File Systems / NoSQL Databases: Often used for Unstructured Data (especially Static, like archives), but NoSQL can handle various types. Exam Importance: Medium - Know which system type fits which data type generally.

2. Scaling Databases & the 2PC Protocol

Scaling Traditional Databases (RDBMS)

Scaling refers to increasing the capacity of a database system.

  1. Vertical Scaling (Scaling Up):

    • Achieved by hardware upgrades on a single machine (faster CPU, more RAM, larger disk).
    • Limitation: Physical limits of how much hardware can be put into one machine.
  2. Horizontal Scaling (Scaling Out):

    • Achieved by adding more machines to a cluster.
    • Requires techniques like:
      • Sharding: Splitting data across multiple machines. **Exam Importance: High** - Understand the concept.
      • Replication: Copying data across multiple machines. **Exam Importance: High** - Understand the concept and benefits.
    • Limitation: Read-to-Write ratio impacts performance; communication overhead between machines increases.

Why Shard Data?

  • Sharding (or Striping): Dividing a large dataset (like a file or database table) into smaller pieces called chunks or shards and distributing them across multiple machines.
  • Benefit: Allows for concurrent/parallel access to different parts of the data, improving performance.
    • Example: If a large file is split into 5 chunks across 3 machines (Machine 1: Chunks 1, 2; Machine 2: Chunks 3, 4; Machine 3: Chunk 5), then Chunks 1, 3, and 5 can potentially be accessed in parallel.

Amdahl’s Law (Speedup Limits)

Exam Importance: High - Understand the formula and its implication

  • Question: How much faster does a parallel program run?
  • Let:
    • = Time for sequential execution.
    • = Time for parallel execution on processors/machines.
    • = Fraction of the program that is not parallelizable (sequential part).
    • = Fraction of the program that is parallelizable.
  • Amdahl’s Formula for Speedup:
  • Example: If 80% of a program can be parallelized (, so ) and you use 4 machines (), the maximum speedup is:
  • Implication: Even with infinite processors, the speedup is limited by the sequential portion ().

Real vs. Ideal Parallel Speedup

  • Amdahl’s Law is simplified. Real-world parallel programs face:
    • Communication Overhead: Time spent coordinating and exchanging data between machines/processes.
    • Workload Imbalance: Some machines/processes may finish their tasks earlier and wait for others.
  • These factors reduce the actual speedup compared to the ideal speedup predicted by Amdahl’s Law.

Guidelines for Effective Parallelization

  1. Maximize the parallelizable fraction () of the program.
  2. Balance the workload evenly across parallel processes/machines.
  3. Minimize the time spent on communication.

Why Replicate Data?

  • Replication: Storing copies of the same data on multiple servers.
  • Benefits: Exam Importance: High - Know the benefits
    • Avoids performance bottlenecks (reads can be distributed).
    • Avoids single points of failure (if one server fails, data is available elsewhere).
    • Enhances Scalability and Availability.

The Consistency Challenge

  • Problem: When data is replicated, how do you ensure all copies are consistent, especially when updates occur concurrently?
  • Example: Two servers replicate a bank balance (1000) goes to Server 1 (Balance becomes 100) goes to Server 2 (Balance becomes 2100 on Server 1, maybe 2000 before interest?). If Event 2 happened on Server 1 after Event 1, it would be 2205. Maintaining consistency is hard.

The Two-Phase Commit (2PC) Protocol

Exam Importance: Medium - Understand the purpose and basic phases

  • Purpose: A protocol used in distributed systems (like distributed RDBMS) to ensure atomicity and strict consistency for transactions spanning multiple nodes.
  • Roles: Coordinator (manages the transaction), Participants (nodes involved).
  • Phases:
    1. Phase I: Voting (or Prepare)
      • Coordinator sends VOTE_REQUEST (or PREPARE) to all participants.
      • Participants check if they can commit (e.g., write changes to stable storage logs).
      • If a participant can commit, it sends VOTE_COMMIT (or PREPARED) back. If not, it sends VOTE_ABORT.
    2. Phase II: Commit (or Decision)
      • If Coordinator receives VOTE_COMMIT from all participants: It sends GLOBAL_COMMIT to all. Participants then make changes permanent (LOCAL_COMMIT).
      • If Coordinator receives any VOTE_ABORT or times out: It sends GLOBAL_ABORT to all. Participants roll back any changes.
  • Limitation: 2PC enforces Strict Consistency, but it involves multiple communication rounds and blocking (participants wait for the coordinator’s decision). This limits scalability and availability (if the coordinator fails, the system can get stuck).

3. The CAP Theorem and BASE Properties

The CAP Theorem

Exam Importance: Very High - Understand C, A, P and the theorem’s statement

  • Describes fundamental limitations of distributed systems with shared data.
  • Three desirable properties:
    1. C - Consistency (Strict/Linearizable): Every read receives the most recent write or an error. All nodes see the same data at the same time. (This is the strict definition used in CAP).
    2. A - Availability: Every (non-failing) node returns a response for any request (read or write) in a reasonable amount of time, without guaranteeing it contains the most recent write.
    3. P - Partition Tolerance: The system continues to operate correctly even if network partitions occur (i.e., messages are lost or delayed between nodes).
  • CAP Theorem Statement (Brewer’s Theorem): A distributed shared-data system can guarantee at most two out of the three properties (Consistency, Availability, Partition Tolerance).
  • Implication for real-world distributed systems: Network partitions (P) do happen. Therefore, systems must choose between prioritizing Consistency (C) or Availability (A) during a partition.

CAP Trade-offs

  • Scenario: Network partition separates nodes.
    • Choose C over A (CP System): To ensure consistency, the side of the partition that cannot guarantee the latest data might have to become unavailable (stop responding to requests) until the partition heals. Example: Some RDBMS configurations, systems like MongoDB, Hbase, Redis favoring consistency.
    • Choose A over C (AP System): To remain available, nodes on both sides of the partition might respond to requests, potentially with stale data. Consistency is sacrificed (becomes “eventual”). Example: Systems like Cassandra, CouchDB, Riak, DynamoDB favoring availability.
    • Choose C and A (CA System): Only possible if you can guarantee no network partitions (P). This generally applies only to single-node systems or tightly coupled clusters, not wide-area distributed systems. Example: Traditional single-node RDBMS.

(Diagram on slide 22 illustrates this trade-off triangle)

Motivation for Large-Scale Databases

  • Companies like Google and Amazon needed databases that could scale horizontally to thousands of machines.
  • At that scale, node failures and network partitions are common, not exceptions.
  • 24/7 Availability was critical (downtime = lost revenue).
  • Conclusion: They needed systems prioritizing Availability (A) and Partition Tolerance (P). This meant sacrificing strict Consistency (C), as dictated by the CAP theorem.

Trading-Off Consistency

  • Consistency isn’t binary (strictly consistent vs. totally inconsistent). There’s a spectrum.
  • Strict Consistency: Harder to implement, often inefficient, limits scalability/availability (as seen with 2PC).
  • Loose/Weaker Consistency (e.g., Eventual Consistency): Easier to implement, more efficient, enables higher availability and scalability.
  • The “right” level of consistency depends on the application’s requirements. Can the application tolerate slightly stale data sometimes?

The BASE Properties

Exam Importance: High - Know the acronym and what each property means

  • An alternative set of guarantees favored by many NoSQL databases, especially AP systems, as a contrast to traditional ACID properties.
  • Stands for:
    • Basically Available: The system guarantees Availability (per CAP). It responds to requests, even if with stale data or an error.
    • Soft State: The state of the system may change over time, even without new input, due to the eventual consistency mechanisms working in the background. Data might be inconsistent temporarily.
    • Eventual Consistency: If no new updates are made to a given data item, eventually all replicas of that item will converge to the same value. Exam Importance: High - Understand this concept

Eventual Consistency

  • Definition: A guarantee that, given enough time without updates, all replicas will become consistent.
  • Challenge: What happens if a client reads from different replicas before they have converged? The client might see inconsistent states (e.g., read older data after just writing newer data).
    • Example: Client updates Webpage-A on Server 1. Immediately reads Webpage-A from Server 2, which hasn’t received the update yet. Client sees the old version.
  • Potential Solution: Protocols like Read Your Own Writes (RYOW) can be implemented to ensure a client always sees its own recent writes, even if others might see older versions temporarily.

4. NoSQL Databases (MongoDB & Cassandra)

Introduction to NoSQL

  • NoSQL: Often stands for “Not Only SQL”.
  • Emerged to address the limitations of traditional RDBMS for large-scale, distributed, often unstructured/semi-structured data scenarios.
  • Mainly follow BASE properties rather than strict ACID.
  • Key Characteristics: Exam Importance: High
    • No strict schema requirements (often “schema-on-read” or flexible schema).
    • No strict adherence to ACID properties (especially trading C for A).
    • Consistency is often traded for Availability and Partition Tolerance.
    • Designed for horizontal scalability.
  • Examples: Amazon DynamoDB, Google Bigtable, Cassandra, MongoDB, CouchDB.

Types of NoSQL Databases

Exam Importance: Medium - Know the main categories and examples

A common (limited) taxonomy:

  1. Document Stores: Store data in document formats (like JSON or BSON). Examples: MongoDB, CouchDB.
  2. Graph Databases: Store data as nodes and edges, optimized for relationship traversal. Examples: Neo4j, ArangoDB.
  3. Key-Value Stores: Simplest type; store data as a dictionary/hash map (key maps to a value). Examples: Redis, Riak, Memcached.
  4. Columnar (or Wide-Column) Databases: Store data in columns rather than rows, good for analytical queries over large datasets. Examples: Cassandra, HBase, Vertica.

(Slides 32, 38, 39, 40 show this taxonomy, focusing on one category at a time)

Document Stores (Focus: MongoDB)

  • Storage: Documents stored in standard formats like JSON (JavaScript Object Notation), BSON (Binary JSON - used by MongoDB, more compact and efficient), XML, etc. Often referred to as BLOBs (Binary Large Objects) if opaque.
  • Indexing: Documents can be indexed on various fields within them, allowing for efficient querying (outperforming simple file systems).
  • Querying: Often support rich query languages; some (like MongoDB, CouchDB) can be queried using MapReduce paradigms.

MongoDB Deep Dive

  • What it is: Cross-Platform, Open Source, Non-relational, Distributed, NoSQL, Document-Oriented data store.
  • Why MongoDB? Addresses challenges of RDBMS like handling large volumes, rich variety (unstructured/semi-structured data), and scaling needs. Designed to:
    • Scale horizontally (scale out).
    • Flexible schema.
    • Fault tolerant.
    • Partition tolerant.
    • Easily distributed.
  • Key Features (Slide 35):
    • Auto-sharding: Automatically partitions data across servers.
    • Doc Oriented: Stores data in BSON documents.
    • Full index support: Can index any field.
    • High Performance: Due to features like indexing, memory mapping.
    • Rich Query Language (QL): Supports powerful queries.
    • Fast in-place updates: Can modify parts of a document efficiently.
    • Easy scalability: Designed for horizontal scaling.
    • Replication: Built-in support for data redundancy and availability.
    • High availability: Through replication and failover mechanisms.
  • Key Concepts (Slide 36): Exam Importance: Medium
    • Unique Key (_id): Each document has a unique _id field, acting like a primary key. Uses BSON for efficient storage.
    • Database: A MongoDB server can host multiple databases.
    • Collection: A grouping of MongoDB documents (analogous to a table in RDBMS, but schema-less).
    • Document: A single record, stored in BSON format (analogous to a row, but with dynamic schema).
    • Dynamic Queries: Supports queries on dynamic data structures.
    • Storing Binary Data (GridFS): MongoDB uses GridFS to store large binary files (larger than BSON document limit, typically 16MB). It breaks the file into smaller chunks and stores metadata about the file in a separate collection (usually named fs.files and fs.chunks).
  • MongoDB Replication (Slide 37): Typically uses a Primary-Secondary model.
    • Writes go to the Primary node.
    • The Primary replicates writes asynchronously to Secondary nodes.
    • Reads can often be directed to Secondaries to distribute load. Provides high availability.
  • MongoDB Sharding (Slide 37):
    • Splits a large Collection across multiple Shards (servers or replica sets).
    • Distributes data automatically based on a shard key.
    • Allows the database to scale horizontally beyond the capacity of a single server.

Graph Databases

(Mentioned in taxonomy, no details provided in these slides)

  • Focus on relationships between data points. Uses nodes, edges, and properties.

Key-Value Stores

(Mentioned in taxonomy, no details provided in these slides)

  • Data stored as key-value pairs. Very fast for simple lookups by key.

Columnar Databases

Exam Importance: Medium - Understand the core difference from row-stores

  • A hybrid of RDBMS ideas and Key-Value stores.
  • Storage: Values are stored grouped by columns, not rows.
    • Row-Order: (Alice, 3, 25, Bob), (4, 19, Carol, 0)
    • Column-Order: (Alice, 4), (3, 19), (25, Carol), (Bob, 0) (grouped by column position/name).
  • Column Families: Often, columns are grouped into “column families” for locality (e.g., store columns B and C together). This optimizes retrieval if you often query those columns together.
  • Querying: Typically queried by key (row key), retrieving specific columns or column families for that key. Efficient for queries that only need a subset of columns across many rows.
  • Examples: HBase, Vertica, Cassandra (Cassandra is often classified here, though it has elements of other types too).

Cassandra

(Mentioned in outline and as an example, but no detailed slides provided)

  • Known for excellent Availability (AP system), horizontal scalability, and performance, particularly for write-heavy workloads. Often classified as a Columnar or Wide-Column store.

5. Summary

  • Data types: structured, unstructured, dynamic, static. Different types often need different database designs.
  • Databases scale up (vertically) or out (horizontally).
  • Sharding and Replication are key techniques for horizontal scaling.
  • Amdahl’s Law shows limits to parallel speedup due to sequential parts.
  • 2PC protocol ensures strict consistency in distributed systems but limits scalability/availability.
  • CAP Theorem: Choose at most two of Consistency, Availability, Partition Tolerance. Real systems often sacrifice C for A and P. Exam Importance: Very High
  • BASE Properties (Basically Available, Soft State, Eventual Consistency): An alternative model favoring availability, common in NoSQL. Exam Importance: High
  • Eventual Consistency: Data becomes consistent over time if updates stop.
  • NoSQL (Not-Only-SQL): Databases often relaxing schema and ACID for better scalability/availability. Characteristics: Flexible schema, BASE properties, horizontal scaling.
  • NoSQL Types: Document Stores (MongoDB), Graph Databases, Key-Value Stores (Redis), Columnar Databases (Cassandra, HBase). Exam Importance: Medium