Question (5 Marks)
explain five key features or mechanisms of Apache Cassandra that contribute to its characteristics of High Availability and Scalability. For each feature/mechanism, briefly describe what it is and how it supports either availability or scalability (or both).
Answer:
Based on the provided text, five key features/mechanisms contributing to Cassandra’s High Availability and Scalability are:
-
Peer-to-Peer (Masterless/Decentralized) Architecture: Cassandra employs a masterless architecture where all nodes are equal and play the same roles (homogeneous). There is no single master node coordinating the cluster.
- Contribution: This eliminates a single point of failure. If any node fails, the rest of the cluster can continue operating, directly supporting High Availability. It also distributes coordination load, aiding Scalability. (1 Mark)
-
Data Replication (Replication Factor - RF): Cassandra stores multiple copies (replicas) of each piece of data across different nodes in the cluster, determined by the Replication Factor (e.g., RF=3 means 3 copies).
- Contribution: If a node holding a replica fails, the data is still accessible from other replicas on different nodes, ensuring data redundancy and fault tolerance, which is crucial for High Availability. (1 Mark)
-
Elastic Scalability (Horizontal Scaling): Cassandra is designed to scale out (horizontally) by adding more commodity (standard, inexpensive) servers/nodes to the cluster. Data and load are automatically distributed across the new nodes.
- Contribution: This allows the cluster to handle massive datasets and high throughput simply by adding more machines without downtime, directly providing Scalability. (1 Mark)
-
Gossip Protocol: This is a peer-to-peer communication protocol used for nodes to discover each other, exchange state information (like load and up/down status), and detect failures efficiently.
- Contribution: Efficient failure detection allows the cluster to quickly identify and route around unavailable nodes, maintaining High Availability. It also helps manage cluster membership smoothly as nodes are added or removed, supporting Scalability. (1 Mark)
-
Hinted Handoffs: When a write request is intended for a replica node that is temporarily unavailable, the coordinator node can store the write locally as a “hint” and deliver it later when the target node recovers.
- Contribution: This mechanism improves write Availability during temporary node outages, as writes can still be accepted (depending on consistency level) even if not all replicas are immediately reachable. (1 Mark)
Question (5 Marks)
According to the provided text:
a) Explain the roles of the Partitioner and the Replication Factor (RF) in how Cassandra distributes and stores data across the cluster. (2 marks) b) Describe the purpose of Anti-Entropy and Read Repair processes in maintaining data consistency among replicas. (2 marks) c) What is the fundamental trade-off managed by Cassandra’s Tunable Consistency feature? (1 mark)
Answer:
a) Partitioner and Replication Factor (RF): - The Partitioner uses a hash function on a row’s partition key to compute a token. This token determines which node in the cluster is responsible for storing the first replica of that row, thus dictating the initial data placement and distribution across nodes. (1 mark) - The Replication Factor (RF) defines how many total copies (replicas) of each row should be stored across different nodes in the cluster (e.g., RF=3 means 3 copies). This provides data redundancy and fault tolerance. The placement of subsequent replicas (after the first one determined by the partitioner) depends on the chosen Replication Strategy. (1 mark)
b) Anti-Entropy and Read Repair: - Anti-Entropy is a background process that proactively compares data replicas across nodes (using mechanisms like Merkle Trees, though not detailed in the text) and repairs any detected inconsistencies, ensuring replicas eventually converge to the same state. (1 mark) - Read Repair is triggered during a read request. If the coordinator node queries multiple replicas (as required by the read consistency level) and detects inconsistencies among them, it sends updates to the nodes holding the outdated data after returning the most recent version to the client. This passively helps repair inconsistencies found during reads. (1 mark)
c) Tunable Consistency Trade-off: - Cassandra’s Tunable Consistency feature allows developers to manage the trade-off between consistency (ranging from strong to eventual), availability, and latency for each read or write operation. Choosing a higher consistency level generally increases latency and may reduce availability (if nodes are down), while lower consistency offers higher availability and lower latency but risks reading stale data. (1 mark)
Question (5 Marks)
Based on the provided text:
a) Describe the initial steps a replica node takes upon receiving a write request forwarded by the coordinator, mentioning the two key structures involved locally on that node. (2 marks) b) What is the purpose of the Commit Log in the write path? (1 mark) c) What happens to the Memtable periodically or when it becomes full, and what is the resulting structure called? Is this resulting structure mutable or immutable? (2 marks)
Answer:
a) Upon receiving a write request from the coordinator, a replica node first writes the data sequentially to its on-disk Commit Log. Then, it writes the data to an in-memory structure called a Memtable. The write is acknowledged back to the coordinator after these two steps are completed. (2 marks)
b) The purpose of the Commit Log is to ensure durability. By writing to disk first, Cassandra guarantees that even if the node crashes before the in-memory Memtable is flushed to disk, the write operation can be recovered from the Commit Log upon restart, preventing data loss. (1 mark)
c) Periodically, or when a Memtable becomes full, its contents are flushed to disk. The data is sorted and written as a new structure called an SSTable (Sorted String Table). According to the text, SSTables are immutable, meaning they cannot be changed once written. (2 marks)