19.6 Availability

  • Core Goal: High availability - The database system should continue functioning (or “remain available”) even in the presence of failures. This is crucial for distributed systems, where failures are more likely.
  • Robustness: The ability to continue operation even with different types of failures.

Key Concepts

  • Failure Detection: The system must be able to detect that a failure has occurred. This can be done through:

    • Monitoring message transmissions and acknowledgments. Repeated failures to receive acknowledgments suggest a link or site failure.
    • Timeouts: If a response is not received within a certain time, assume failure.
  • Reconfiguration: After a failure is detected, the system must reconfigure itself to:

    • Continue processing transactions (possibly with reduced performance).
    • Isolate the failed component(s).
  • Recovery: When a failed component (site or link) is repaired, the system must integrate it back into the active system.

  • Distinguishing Failures is Hard: It’s often difficult to distinguish between:

    • Site failure.
    • Link failure.
    • Network partition.

    This ambiguity makes designing robust systems challenging. A site might appear to be down when it’s actually just isolated due to a network partition.

Types of Failures (and basic handling)

  • Message Loss: Handled by retransmission (using protocols like TCP/IP).
  • Link Failure:
    • The network may try to find an alternative route.
    • If no alternative route exists, it may indicate a network partition.
  • Site Failure: Detected when other sites cannot communicate with it.
  • Network Partition: The network is split into two or more partitions that cannot communicate with each other. A partition can consist of a single node.

Reconfiguration Procedures

When a failure is detected, a reconfiguration procedure must be initiated.

  • Aborting Transactions:

    • Transactions active at a failed/inaccessible site at the time of failure should be aborted.
    • This is important because these transactions might hold locks at active sites, blocking other transactions. Waiting for the failed site to recover could cause significant delays.
    • With data replication, reads and updates are possible even if some replicas are inaccessible.
  • Updating the Catalog:

    • If replicated data were stored at a failed/inaccessible site, the catalog (system metadata) must be updated.
    • This prevents queries from trying to access the unavailable copy.
    • When a failed site recovers, the catalog must be updated again, and the data at the site brought up-to-date.
  • Electing a New Server (if necessary):

    • If a central server (e.g., name server, concurrency coordinator, deadlock detector) fails, an election must be held to choose a replacement.
  • Consistency Constraints (during reconfiguration): Reconfiguration schemes must be designed to work correctly even in the case of network partitions. Specifically, they must avoid these situations:

    • Two or more central servers being elected in different partitions.
    • Multiple partitions independently updating a replicated data item.

19.6.1 Majority-Based Approach

  • Extends: The majority protocol for concurrency control (Section 19.5.1.4).
  • Version numbers: Each data object stores a version number.
  • Mechanism:

    • Locking: To update an object, a transaction must obtain a lock on a majority of the replicas.
    • Reading: Read operations look at all locked replicas and use the value from the replica with the highest version number. (Optionally, they can update older replicas).
    • Writing:
      1. Read all replicas (like a read operation) to determine the highest version number.
      2. The new version number is one greater than the highest version number.
      3. Write to all locked replicas (which must be a majority) and update their version numbers.
  • Failure Tolerance: Transactions can proceed as long as:

    • A majority of replicas of all written objects are available at commit time.
    • A majority of replicas are read to find the highest version number during reads.
    • The two-phase commit protocol can proceed using only available sites.
  • Reintegration: Trivial (no special action needed) because:

    • Writes already updated a majority.
    • Reads will find at least one up-to-date replica (due to the majority requirement).

19.6.2 Read One, Write All Available Approach

  • Based on: A special case of quorum consensus (biased protocol).
  • Give all sites unit weights.
  • Read quorum = 1.
  • Write quorum = n (all sites).
  • In this case, we do not need to use version numbers.
  • Mechanism:

    • Read: Read any available replica (and obtain a read lock there).
    • Write: Attempt to write to all replicas (and obtain write locks). If a site is down, proceed without it.
  • Advantages: Simple, allows reads even if some replicas are unavailable.

  • Disadvantages:

    • Writes can be blocked if any replica is unavailable (in the basic “read one, write all” scheme).
    • The “read one, write all available” variant can lead to inconsistencies if there are network partitions. Different partitions could update the same data item.
    • Reintegration is complex: After a failure, a recovering site needs to determine which updates it missed. This is non-trivial.

19.6.3 Site Reintegration

  • Challenge: When a failed site or link recovers, it must be integrated back into the system.

  • Requirements:

    • Update system tables to reflect the rejoined site/link.
    • If the site had replicas, bring its data up-to-date. This is not simple because updates might have occurred while the site was down.
  • Naive Solution (Undesirable): Halt the entire system while the site rejoins. This is usually unacceptable due to disruption.

  • Better Solution: Allow concurrent updates while the site reintegrates.

    • The site must catch up on missed updates before granting any new read/write locks.
    • If a link recovers, all sites should be notified promptly (because partitioning restricts operations).

19.6.4 Comparison with Remote Backup

  • Remote Backup (Section 16.9): Data and log records are replicated at a backup site. Concurrency control and recovery are handled at the primary site.

    • Advantages: Avoids two-phase commit, lower overhead, transactions only contact one site.
    • Disadvantages: Less availability than full replication.
  • Replication (in this chapter): Data is replicated at multiple sites.

    • Advantages: Higher availability (can use majority protocols).
    • Disadvantages: More overhead (two-phase commit, communication with multiple sites).

19.6.5 Coordinator Selection

  • Problem: If a coordinator fails, a new one must be chosen.

  • Approaches:

    • Backup Coordinator: A designated backup site maintains enough state to take over immediately if the primary coordinator fails.

      • Advantages: Fast recovery.
      • Disadvantages: Overhead of maintaining the backup state.
    • Election Algorithms: If there’s no backup (or the backup also fails), an election algorithm is used to choose a new coordinator dynamically.

      • Requirements:
        • Must choose a unique coordinator.
        • Must work even with multiple failures.
      • Example: Bully Algorithm.
    • Bully Algorithm (Simplified):

      1. Each site has a unique ID.
      2. The coordinator is always the active site with the highest ID.
      3. If a site suspects the coordinator has failed, it sends “election” messages to sites with higher IDs.
      4. If no response, the site declares itself the coordinator.
      5. If a site with a higher ID responds, the election is deferred.
      6. A recovered site “bullies” lower-ID sites to take over as coordinator.
      7. Should also verify the chosen coordinator can access the majority of sites, in order to avoid multiple coordinators.