CAP Theorem Explained (Brewer’s Theorem)
Introduction
- Context: Applies to distributed computing environments where interconnected nodes share data.
- Core Statement: It is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:
- Consistency (C)
- Availability (A)
- Partition Tolerance (P)
- The Trade-off: You must choose two guarantees, inevitably sacrificing the third one.
The Three Guarantees Defined
-
Consistency:
- Every read operation receives the most recent write or an error.
- All nodes see the same data at the same time.
-
Availability:
- Every request receives a (non-error) response.
- This response is not guaranteed to contain the most recent write.
- Each non-failing node will return a response in a reasonable amount of time.
-
Partition Tolerance:
- The system continues to operate (respond to requests) even when there is a network partition (communication break) between nodes.
- Messages can be dropped or delayed between nodes without causing total system failure.
Real-life Example: Training Institute “XYZ”
- Initial Setup: Single coordinator (Amey) managing schedules for 50 instructors. Central schedule source. This system appears Consistent and Available, but relies on a single point/process, implicitly lacking Partition Tolerance if Amey or the central system fails.
- Scaling Problem: Growth leads to more instructors, clients, and a second coordinator (Joey).
- Inconsistency Issue: Amey updates the schedule, but Joey doesn’t see the update immediately (and vice-versa). A client calling Joey might get different information than if they called Amey. This demonstrates a lack of Consistency when updates aren’t shared instantaneously.
- Availability Issue: If one coordinator is unavailable (e.g., on leave), can the other still function and provide schedules? The plan to inform the other person via email ensures some level of operation continues, touching upon Availability concerns during failures (like a partition).
- Partition Tolerance Implication: The scenario where Amey and Joey operate somewhat independently, possibly with inconsistent data until they “patch up”, implies the system is trying to be Partition Tolerant (work despite potential communication delays/issues between them). However, this often comes at the cost of immediate Consistency or potentially Availability of the absolute latest, confirmed schedule. The text concludes one scenario results in a system that is Partition Tolerant but not Available at that specific time for updates.
Making the Choice: Pick Two (Combinations)
In the real world, network partitions happen. Therefore, Partition Tolerance (P) is often considered mandatory for truly distributed systems. The main trade-off becomes between Consistency (C) and Availability (A).
-
AP (Availability + Partition Tolerance):
- Sacrifices Consistency.
- The system stays available under partitions, but may return stale data.
- Emphasis on uptime and responsiveness.
- Examples: Riak, Cassandra, CouchDB, Dynamo-like systems.
-
CP (Consistency + Partition Tolerance):
- Sacrifices Availability.
- During a partition, the system may become unavailable (return errors or timeouts) to ensure that inconsistent data is not returned.
- Emphasis on data correctness.
- Examples: HBase, MongoDB, Redis, MemcacheDB, BigTable-like systems.
-
CA (Consistency + Availability):
- Sacrifices Partition Tolerance.
- Assumes the network is reliable and partitions won’t occur. If a partition does happen, the system’s behavior is undefined regarding C & A guarantees.
- Often applies to traditional, single-site RDBMS or tightly coupled clusters.
- Examples: Traditional RDBMS like PostgreSQL, MySQL (in standard single-node or tightly clustered setups).
(Refer to Figure 3.15 for a visual representation and database examples)
BASE: Basically Available, Soft State, Eventual Consistency
- Often associated with AP systems.
- An alternative design philosophy that prioritizes Availability over strong Consistency.
Key BASE Concepts
- Where is it used?
- Distributed computing.
- Why is it used?
- To achieve High Availability.
- How is it achieved (Eventual Consistency)?
- Assume a data item. If no new updates are made for a period, eventually all accesses to that item will return the last updated value.
- Data is replicated across nodes, and this replication happens over time (“eventually”), not instantaneously.
- What is replica convergence?
- The state where a system has achieved eventual consistency; all replicas of a data item hold the same (latest) value.
- Conflict Resolution: How is the conflict resolved?
- Needed because nodes might receive conflicting updates during the period before convergence.
- Methods:
- (a) Read Repair: Discrepancy detected during a read operation triggers a correction. Slows down the read.
- (b) Write Repair: Discrepancy detected during a write operation triggers a correction. Slows down the write.
- (c) Asynchronous Repair: A background process periodically checks for and corrects discrepancies, not tied directly to read/write operations.