Space shuttles, telecommunication networks, and even new hip tech startups are home to big and complex systems. These systems employ thousands of software developers on different teams. Although big systems don’t have an exact scientific definition, we will refer to systems that contain more than one service working together as a “big system.” Additional clarity is provided through plentiful Wikipedia links, but the overall structure is inspired by Martin Kelppmann‘s “Designing data-intensive applications.”
These systems can be fault-tolerant, meaning one part deviating from the expected behavior won’t necessarily cripple the rest of the system. They could also be self-healing, i.e., the system’s services can auto-recover without manual intervention for some unexpected scenarios. They could also be idempotent, which means that running the input through the system many times doesn’t change the system’s state any more than running the same input the first time. These systems can be stateless, by which we mean any instance of the service can handle any input, or contrastingly stateful, meaning that some instances are dedicated to handling specific inputs. Statelessness is recommended since it saves our components from the headache of coordination, something we discuss later.
These systems’ inputs satisfy some pre-conditions. Their output must meet a set of post-conditions if the execution terminates successfully. This is called design by contract and is the basis for all correctness tests for a system.
These systems satisfy functional requirements, meaning the system should do exactly what it’s supposed to do. Take Airbnb, for instance; listing all the beach houses in Seattle and booking the correct beach house when clicking on the app’s button are two functional requirements.
These big systems also satisfy non-functional requirements. Non-functional requirements rely on the application’s business needs; they include (but are not limited to) availability, scalability, reliability, consistency, and performance (be it response time, throughput, or another metric). For example, booking that beach house should be within acceptable time (maybe less than 5 seconds), and the website should be up all day long.
- A node is a single computer (or a device) on the network.
- The state of the system refers to the metadata of which the system keeps track.
- A read request is a piece of information retrieved from a database without altering the system’s state.
- A write request incurs a change in the state of the system.
The hype of microservices
In this article, we discuss “Shared-nothing” architecture, ones that rely on horizontal scaling. These systems scale by adding more nodes together rather than beefing up the same node with extra resources like CPU, memory, and hard drives. This is the predominant pattern used in the industry and is used in cloud computing.
So far, we have referred to the different parts of the system as components. The term used in the industry is microservices (a rebranding of the earlier service-oriented architecture). Much like classes in object-oriented programming, microservices should have a single responsibility. We shouldn’t cluster unrelated functionality into one microservice just because it’s convenient.
Application architectures should separate data nodes from control nodes, i.e., data in DBs, and control logic in “compute” nodes. Data does outlive code, so mixing concerns could be a nightmare down the line if we need to modify the system. Usually, the code should be ready to handle both backward and forward compatibility, which is somewhat tricky.
These different microservices interact together using network calls or queues to satisfy a functional requirement. They should be modular and maintainable. These microservices should be scalable, i.e., able to serve 10x the traffic with minimal changes in the system and barely noticeable performance difference when scaling up the system.
Designing a microservice is much like designing an application programming interface (API). Microservices should hide a great deal of implementation detail behind a stable and relatively straightforward abstraction. They should avoid accidental complexity at all costs.
Now, these microservices use queues, databases, caches, and coordination services to do their job. Much like importing a library when coding a program, using a database or a queue makes designing services easier since you rely on the guarantees provided by a database or a queue.
We use the term “database” as an all-encompassing term. Almost every data node could be considered a database using such a definition. Databases come in different types and flavors, from relational to document and from graph databases to distributed caches. Some argue that even queues are a particular kind of database.
Usually, the database choice is made according to the entities involved in the applications, their relationships, and the expected join traffic. A graph database, for example, can be represented using a relational model with a considerable amount of joins between rows, deteriorating both the developer’s productivity and the database performance.
The module of the database that maps keys to the hard disk is called the storage engine. It’s usually implemented using a log-structured-merge-tree (LSM) or B-tree. Regardless of how indexes’ implemented, databases use write-ahead logging (WAL) to write transactions to disk before applying them to the database. It’s used to restore the system in case of a crash.
Some databases grant ACID (atomic, consistency, isolation, and durability) guarantees for transactions. Atomic means “all or nothing.” Consistency highly depends on the application’s definition of it. Isolation means that a transaction executes in isolation of every other transaction. Durability means that changes are persisted on disk (or replicated to different nodes) before acknowledging success to the application.
Database anomalies result from application use and the isolation level (the database’s guarantees). For some applications, these anomalies would never happen. Examples of anomalies include:
- Dirty reads: A transaction reads a record that has been modified by another transaction and not yet committed.
- Dirty writes: A transaction writes a record that has been modified by another transaction and not yet committed.
- Nonrepeatable reads: A transaction reads the same record twice and gets different results throughout the same transaction.
- Phantom reads: During a transaction, records are added/removed.
- Lost update: A race condition where the outcome results because of the sequence of uncontrolled events. For example, a couple of transactions may execute concurrently so that one write overrides the other.
- Write skew: An update transaction based upon stale data.
To deal with some of the anomalies mentioned above, engineers in academia and the industry came up with isolation levels. Although these isolation levels are standard, it’s advisable to read the documentation of the specific databases multiple times before attempting to use them:
- Read committed: Allows for no dirty reads and no dirty writes.
- Snapshot isolation: Writes don’t block reads, and reads don’t block writes. Each transaction is based on a consistent snapshot of the database.
- Serializability: Mimics serial execution even if multithreading is used. This isolation mode handles all database anomalies. It’s implemented using:
- Serial execution
- Pessimistic locking: Two-Phase-Locking
- Optimistic locking
Scaling includes replication and partitioning. We dedicate a subsection for each.
The first rule in distributed systems is that nodes fail. They fail abruptly and unexpectedly. To prevent data loss when nodes fail, we use replication since it provides redundancy. If the data were to live on only one node, then we would need manual intervention to recover the data and unblock the users trying to access their account information from the failing node. Manual intervention goes against highly desirable self-healing. Manual intervention takes hours, self-healing only seconds.
One way to prevent a single point of failure (SPOF) is to copy all the data from one node to other nodes. Replication also helps with the read traffic performance, as a bonus, since reads can be directed to more nodes, reducing the strain on each node. Although replication is beneficial, its difficulty arises from keeping data in sync with write requests. We discuss three models of replication briefly.
Single leader replication
Read requests can be directed to any node in the system (called read replicas). On the other hand, write requests must be directed to a leader (or master) elected at an earlier point in time. The leader then passes the change stream (replication log) to other nodes. Replication logs can be implemented using different methods like Write Ahead Log or Statement-Based Replication.
Some systems replicate data on different nodes using active anti-entropy or read repair. Active anti-entropy works by comparing Merkle trees, a tree of hashes, between nodes to figure out what subtrees need to be copied from one node to another. Read repair only relies on the read traffic to replicate the data on different nodes. If a record is never read, then data will remain stale indefinitely. Riak gives both options to the user.
Replication lag gives rise to BASE (Basically Available, Soft State, Eventual Consistency) as contrasted to the ACID model we discussed earlier. Basically Available means read and write traffic is as available as possible. Soft State means that the state could be a little different from one read replica to another. Eventual Consistency means that eventually (after no additional write traffic), the state will be consistent in all read replicas.
A non-leader is not critical, and its failure does not affect the state of the system. Leader failover, however, is not straightforward and has to be carefully implemented. On detecting the leader’s failure, the read replicas should elect a new leader to take over. This process of electing a new leader is called achieving consensus, and we will discuss it later in this article.
Multiple leader replication
Network partitions do happen, and they happen extensively. A network partition is when some nodes in the system are disconnected from other nodes in the network. When it happens with a single leader model, writes are blocked.
This gives rise to multi-leader architectures, where a few leaders can handle write traffic. These systems are more complex and require coordination between the leaders. The downside is that writes do require consensus from other nodes.
With leaderless replication, no node is special. We must write to many nodes (w; write quorum) and read from many nodes (r; read quorum) where w + r > n (number of nodes) to get consistent data.
If we have a big database with millions of records, it’s impossible to store all the data on one node. We must split the data one way or another on multiple nodes following the “horizontal scaling” approach to which we’re adhering. Each node can have one (or more) partitions.
There are two popular schemes for partitioning:
- Partitioning by key
- Simple and easy to implement
- Produces “hot spots” where most of the traffic is only directed to some nodes
- Range queries are straightforward
- Partitioning by the hash of the key
- The hash function must be consistent between versions
- Reduces “hot spots” by being able to scatter the nodes to many nodes more evenly
- Range queries are passed to all partitions like in MongoDB, for other databases, like Couch and Riak, they don’t support range queries
Replication and partitioning together
Replication and partitioning go hand in hand in a big system. Nodes will have different replicas of partitions, as shown in the image below.
Coordination in big systems
Coordination is a not a requirement of the system but a design choice. Some systems can outsource the coordination problems to a cache, a database, a distributed lock manager, or Zookeeper.
Coordination problems include:
- Electing a leader server (consensus)
- Managing group membership
- Crash detection: Keeping tracking of available servers
- Service discovery
- Allocating work to nodes
- Managing metadata like partitions and failovers
- Distributed lock management
Although there are algorithms to handle consensus like Paxos and Raft, they’re not straightforward. Implementation can easily incur bugs. Outsourcing to a third-party system (like Zookeeper) makes sense only when keeping in mind the operational overhead incurred when the third-party system is unavailable.
Problems with networks
At the beginning of the article, we mentioned that we rely on commodity hardware and the network to build the so-called big systems. This choice doesn’t come without its tradeoffs. The network is unreliable, and it causes problems that include:
- Message delays and network partitioning: There’s no good way to differentiate between a network delay, a network partition, an overloaded node, or a node crash. The best we can use is a timeout for a heartbeat to assume a node is down.
- Clock drift: Even with using network time protocol (NTP), a node can never know the correct time for sure. Clock drifts cause all kinds of problems from a leader not renewing its leadership or for a node having an expired lock making it think it’s still valid.
In a single-leader architecture, a leader thinking that it’s the leader while it’s not anymore (due to a network partition or a clock drift) can lead to a split-brain, which can cause data corruption. A fencing token could be used for such cases.
Outdated CAP theorem
Although outdated, the CAP theorem states that one can only choose from one of two things, “consistency” or “availability,” since network partitioning happens all the time. Consistency is generally a vague term, so we replace it with the word “linearizable,” a very strong consistency model.
Linearizable means that if operation B starts after operation A has been completed, operation B must see the same state’s system as completing A or a newer state. It’s different from serializable since it doesn’t assume concurrent transactions.
Systems can be CP or AP; they could be consistent (linearizable) or available, but not both. Some systems like social media sacrifice linearizability to have high availability, while other applications like banking sacrifice availability for linearizability.
In the article, we discussed big systems with “Shared-nothing” architecture. We described desired properties we’d like to see in a distributed system like availability, reliability, and scalability. We also described different types of databases and their isolation levels. We then discussed some of the considerations for scaling the system, like partitioning, replication, and consensus.
Building a big complex system is an exciting endeavor; now you're well on the way to understanding, and even building, your own!