Nowadays many of applications are data-intensive, as opposed to computed-intensive. CPU power are rarely a bottleneck of these applications, where the bigger challenge are usually the amount of data, the complexity of data and the speed which it is changing.
Table Of Contents
In this article, it will walk you through 3 important pillars when designing a robust software system — Reliability, Scalability and Maintainability, with questions and bullet-points to refresh your memory of best practises in a step-by-step manner.
Components of data intensive application
A data intensive application is typically built from certain blocks of functionality.
- Database — store data so that they, or another application, can find it again later.
- Cache — remember an expensive operation, to speed up reads.
- Search indexes — allow user to search data by keywords or filter in various way.
- Stream processing — send a message to another process, to be handled asynchronously.
- Batch processing — periodically crunch a large amount of accumulated data.
For software, it is inevitable to have hardware & software faults, like user making mistakes or using the software in an unexpected ways, hard disk crash, RAM becomes faulty. Everything can go wrong called “faults”, and systems that can anticipate faults and cope with them are called “fault-tolerant” and “resilient”. If the things continuously to work correctly are called “reliable”. Reliability means making systems work correctly, even when faults occur
(Reliability) How do you ensure that the data remains correct and complete, even when things go wrong internally?
- To decouple the places where users can make most of mistakes where they can cause failures. In particular, set up a non-production sandbox environment where people can explore and experiment safely, using real data, without affecting real users.
- To test fault tolerant system, you can increase the rate of fault by triggering them deliberately — for example by randomly killing the internal processes like chaos monkey in Netflix
- To prevent hardware crash — the approach is adding redundancy to the individual hardware
- To set up detailed and clear monitoring, such as performance metrics and error rate. Monitoring can show us early signals and allow us to check whether assumptions or constraints are being violated.
- To allow quick and easy recovery, such as make it fast to roll back configuration changes.
In this section, the big idea behind is load parameters (a main factor that drives the architecture design), performance metrics(latency, response time) to measure the bottleneck and performance of system.
Scalability means having strategies for keeping performance good, even when load parameters increase — like measuring the load & performance by latency percentiles and throughput.
Load parameters are metrics of performance depends on the architecture of system , like requests per second of a web server, the ratio of reads and writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache or something else.
Twitter Example — A user publish new message to their followers. In this example, we consider various load parameters, including database reads and writes ratio, read requests per second, write requests per second.
- Approach 1: Posting a tweet simply inserts the new tweet into a global collection of tweets. When a user requests their home timeline, look up all the people they follow, find all the tweets for each of those users, and merge them (sorted by time).
- Approach 1 is read intensive, and only 1 write operation.
- Approach 2: Maintaining a cache for each user’s home timeline. What a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches.
- Approach 2: Read operation is cheap because its result has been computed ahead of time.
- Approach 2: Downside of approach 2 is that posting a tweet now requires a lot of extra work. On average, a tweet is delivered to about 75 followers, so 4.6k tweets per second become 345k writes per second to the home timeline caches.
Twitter solution — a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers (i.e., celebrities) are excepted from this fan-out.
After having an idea of load parameters, we will discuss the topic of load performance. To measure performance, we usually care about
- throughput — the number of records we can process per second in batch processing system such as Hadoop.
- Services response time — the network delay and services queuing time in online systems.
Latency and response time. Latency is the duration that a request is waiting to be handled (queuing delay). Response time is what the client sees, it includes network delays and queueing delays.
Percentile is a better way for measurement, it helps to understand ratio of users who are experiencing in speed.
It can be considered in three areas operability, simplicity & evolvability
- Operability — make it easy for operations teams to keep the system running smoothly.
- Simplicity — make it easy for new engineers to understand the system.
- Evolvability — make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change.
How do you provide consistently good performance to clients, even when parts of your system are degraded? (Maintainability)
- Monitoring the health of the system and quickly restoring service if it goes into a bad state.
- Tracking down the cause of problems, such as system failures or degraded performance.
- Keeping software and platforms up to date, including security patches.
- Keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage (downstream and upstream services).
- Anticipating future problems and solving them before they occur (e.g., capacity planning).
- Establishing good practices and tools for deployment, configuration management, and more.
- Performing complex maintenance tasks, such as moving an application from one platform to another.
- Maintaining the security of the system as configuration changes are made
- Defining processes that make operations predictable and help keep the production environment stable.
- Preserving the organisation’s knowledge about the system, even as individual people come and go.
- Martin Kleppmann: “‘Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems Paperback — Illustrated, April 18, 2017”