Reliable, Scalable and Maintainable of Data-Intensive Application

Nowadays many of applications are data-intensive, as opposed to computed-intensive. CPU power are rarely a bottleneck of these applications, where the bigger challenge are usually the amount of data, the complexity of data and the speed which it is changing.

Table Of Contents

What are the commonly needed functionality of data-intensive applications?

[Reliability] How do you ensure that the data remains correct and complete, even when things go wrong internally (high degree of reliability, fault-tolerant and resilience)?

[Scalability] How do you scale to handle an increase in load? What does a good API for the service look like?

[Maintainability] How operability, simplicity and evolvability acts as when we consider maintainability?

Components of data intensive application

  1. Database — store data so that they, or another application, can find it again later.
  2. Cache — remember an expensive operation, to speed up reads.
  3. Search indexes — allow user to search data by keywords or filter in various way.
  4. Stream processing — send a message to another process, to be handled asynchronously.
  5. Batch processing — periodically crunch a large amount of accumulated data.
Figure 1. Possible Architecture for a data system that combines several components


For software, it is inevitable to have hardware & software faults, like user making mistakes or using the software in an unexpected ways, hard disk crash, RAM becomes faulty. Everything can go wrong called “faults”, and systems that can anticipate faults and cope with them are called “fault-tolerant” and “resilient”. If the things continuously to work correctly are called “reliable”. Reliability means making systems work correctly, even when faults occur

(Reliability) How do you ensure that the data remains correct and complete, even when things go wrong internally?

  • To decouple the places where users can make most of mistakes where they can cause failures. In particular, set up a non-production sandbox environment where people can explore and experiment safely, using real data, without affecting real users.
  • To test fault tolerant system, you can increase the rate of fault by triggering them deliberately — for example by randomly killing the internal processes like chaos monkey in Netflix
  • To prevent hardware crash — the approach is adding redundancy to the individual hardware
  • To set up detailed and clear monitoring, such as performance metrics and error rate. Monitoring can show us early signals and allow us to check whether assumptions or constraints are being violated.
  • To allow quick and easy recovery, such as make it fast to roll back configuration changes.


Scalability means having strategies for keeping performance good, even when load parameters increase — like measuring the load & performance by latency percentiles and throughput.

Load parameters are metrics of performance depends on the architecture of system , like requests per second of a web server, the ratio of reads and writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache or something else.

Twitter Example — A user publish new message to their followers. In this example, we consider various load parameters, including database reads and writes ratio, read requests per second, write requests per second.

  • Approach 1 is read intensive, and only 1 write operation.
Figure 2.0 Query of database
Figure 2.1 (Approach 1) Simple relational schemas for implementing a Twitter home timeline
  • Approach 2: Maintaining a cache for each user’s home timeline. What a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches.
  • Approach 2: Read operation is cheap because its result has been computed ahead of time.
  • Approach 2: Downside of approach 2 is that posting a tweet now requires a lot of extra work. On average, a tweet is delivered to about 75 followers, so 4.6k tweets per second become 345k writes per second to the home timeline caches.
Figure 2.2 Twitter’s data pipeline for delivering tweets to followers, with load performances

Twitter solution — a hybrid of both approaches. Most users’ tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers (i.e., celebrities) are excepted from this fan-out.

After having an idea of load parameters, we will discuss the topic of load performance. To measure performance, we usually care about

  • Services response time — the network delay and services queuing time in online systems.

Latency and response time. Latency is the duration that a request is waiting to be handled (queuing delay). Response time is what the client sees, it includes network delays and queueing delays.

Percentile is a better way for measurement, it helps to understand ratio of users who are experiencing in speed.

Figure 3.0 Illustrating mean and percentiles: response times for a sample of 100 requests to a service
Figure 3.1 When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request.


It can be considered in three areas operability, simplicity & evolvability

  • Operability — make it easy for operations teams to keep the system running smoothly.
  • Simplicity — make it easy for new engineers to understand the system.
  • Evolvability — make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change.

How do you provide consistently good performance to clients, even when parts of your system are degraded? (Maintainability)

  • Monitoring the health of the system and quickly restoring service if it goes into a bad state.
  • Tracking down the cause of problems, such as system failures or degraded performance.
  • Keeping software and platforms up to date, including security patches.
  • Keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage (downstream and upstream services).
  • Anticipating future problems and solving them before they occur (e.g., capacity planning).
  • Establishing good practices and tools for deployment, configuration management, and more.
  • Performing complex maintenance tasks, such as moving an application from one platform to another.
  • Maintaining the security of the system as configuration changes are made
  • Defining processes that make operations predictable and help keep the production environment stable.
  • Preserving the organisation’s knowledge about the system, even as individual people come and go.


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store