19.6 Availability
- Core Goal: High availability - The database system should continue functioning (or “remain available”) even in the presence of failures. This is crucial for distributed systems, where failures are more likely.
- Robustness: The ability to continue operation even with different types of failures.
Key Concepts
-
Failure Detection: The system must be able to detect that a failure has occurred. This can be done through:
- Monitoring message transmissions and acknowledgments. Repeated failures to receive acknowledgments suggest a link or site failure.
- Timeouts: If a response is not received within a certain time, assume failure.
-
Reconfiguration: After a failure is detected, the system must reconfigure itself to:
- Continue processing transactions (possibly with reduced performance).
- Isolate the failed component(s).
-
Recovery: When a failed component (site or link) is repaired, the system must integrate it back into the active system.
-
Distinguishing Failures is Hard: It’s often difficult to distinguish between:
- Site failure.
- Link failure.
- Network partition.
This ambiguity makes designing robust systems challenging. A site might appear to be down when it’s actually just isolated due to a network partition.
Types of Failures (and basic handling)
- Message Loss: Handled by retransmission (using protocols like TCP/IP).
- Link Failure:
- The network may try to find an alternative route.
- If no alternative route exists, it may indicate a network partition.
- Site Failure: Detected when other sites cannot communicate with it.
- Network Partition: The network is split into two or more partitions that cannot communicate with each other. A partition can consist of a single node.
Reconfiguration Procedures
When a failure is detected, a reconfiguration procedure must be initiated.
-
Aborting Transactions:
- Transactions active at a failed/inaccessible site at the time of failure should be aborted.
- This is important because these transactions might hold locks at active sites, blocking other transactions. Waiting for the failed site to recover could cause significant delays.
- With data replication, reads and updates are possible even if some replicas are inaccessible.
-
Updating the Catalog:
- If replicated data were stored at a failed/inaccessible site, the catalog (system metadata) must be updated.
- This prevents queries from trying to access the unavailable copy.
- When a failed site recovers, the catalog must be updated again, and the data at the site brought up-to-date.
-
Electing a New Server (if necessary):
- If a central server (e.g., name server, concurrency coordinator, deadlock detector) fails, an election must be held to choose a replacement.
-
Consistency Constraints (during reconfiguration): Reconfiguration schemes must be designed to work correctly even in the case of network partitions. Specifically, they must avoid these situations:
- Two or more central servers being elected in different partitions.
- Multiple partitions independently updating a replicated data item.
19.6.1 Majority-Based Approach
- Extends: The majority protocol for concurrency control (Section 19.5.1.4).
- Version numbers: Each data object stores a version number.
-
Mechanism:
- Locking: To update an object, a transaction must obtain a lock on a majority of the replicas.
- Reading: Read operations look at all locked replicas and use the value from the replica with the highest version number. (Optionally, they can update older replicas).
- Writing:
- Read all replicas (like a read operation) to determine the highest version number.
- The new version number is one greater than the highest version number.
- Write to all locked replicas (which must be a majority) and update their version numbers.
-
Failure Tolerance: Transactions can proceed as long as:
- A majority of replicas of all written objects are available at commit time.
- A majority of replicas are read to find the highest version number during reads.
- The two-phase commit protocol can proceed using only available sites.
-
Reintegration: Trivial (no special action needed) because:
- Writes already updated a majority.
- Reads will find at least one up-to-date replica (due to the majority requirement).
19.6.2 Read One, Write All Available Approach
- Based on: A special case of quorum consensus (biased protocol).
- Give all sites unit weights.
- Read quorum = 1.
- Write quorum = n (all sites).
- In this case, we do not need to use version numbers.
-
Mechanism:
- Read: Read any available replica (and obtain a read lock there).
- Write: Attempt to write to all replicas (and obtain write locks). If a site is down, proceed without it.
-
Advantages: Simple, allows reads even if some replicas are unavailable.
-
Disadvantages:
- Writes can be blocked if any replica is unavailable (in the basic “read one, write all” scheme).
- The “read one, write all available” variant can lead to inconsistencies if there are network partitions. Different partitions could update the same data item.
- Reintegration is complex: After a failure, a recovering site needs to determine which updates it missed. This is non-trivial.
19.6.3 Site Reintegration
-
Challenge: When a failed site or link recovers, it must be integrated back into the system.
-
Requirements:
- Update system tables to reflect the rejoined site/link.
- If the site had replicas, bring its data up-to-date. This is not simple because updates might have occurred while the site was down.
-
Naive Solution (Undesirable): Halt the entire system while the site rejoins. This is usually unacceptable due to disruption.
-
Better Solution: Allow concurrent updates while the site reintegrates.
- The site must catch up on missed updates before granting any new read/write locks.
- If a link recovers, all sites should be notified promptly (because partitioning restricts operations).
19.6.4 Comparison with Remote Backup
-
Remote Backup (Section 16.9): Data and log records are replicated at a backup site. Concurrency control and recovery are handled at the primary site.
- Advantages: Avoids two-phase commit, lower overhead, transactions only contact one site.
- Disadvantages: Less availability than full replication.
-
Replication (in this chapter): Data is replicated at multiple sites.
- Advantages: Higher availability (can use majority protocols).
- Disadvantages: More overhead (two-phase commit, communication with multiple sites).
19.6.5 Coordinator Selection
-
Problem: If a coordinator fails, a new one must be chosen.
-
Approaches:
-
Backup Coordinator: A designated backup site maintains enough state to take over immediately if the primary coordinator fails.
- Advantages: Fast recovery.
- Disadvantages: Overhead of maintaining the backup state.
-
Election Algorithms: If there’s no backup (or the backup also fails), an election algorithm is used to choose a new coordinator dynamically.
- Requirements:
- Must choose a unique coordinator.
- Must work even with multiple failures.
- Example: Bully Algorithm.
- Requirements:
-
Bully Algorithm (Simplified):
- Each site has a unique ID.
- The coordinator is always the active site with the highest ID.
- If a site suspects the coordinator has failed, it sends “election” messages to sites with higher IDs.
- If no response, the site declares itself the coordinator.
- If a site with a higher ID responds, the election is deferred.
- A recovered site “bullies” lower-ID sites to take over as coordinator.
- Should also verify the chosen coordinator can access the majority of sites, in order to avoid multiple coordinators.
-