Performance Realities: Throughput vs Latency vs Concurrency — What Actually Matters
Introduction:
Throughput, latency, and concurrency are three of the most frequently used terms in performance engineering. They appear in architecture discussions, capacity planning documents, and incident postmortems. Most engineers use them interchangeably or conflate them in ways that lead to poor optimisation decisions.
They are not the same thing. They measure different aspects of system behaviour, they respond to different interventions, and optimising for one can actively degrade the others. Understanding the distinction — and knowing which one actually matters for your specific system — is the foundation of effective performance engineering.
The wrong metric leads to the wrong optimisation. The wrong optimisation leads to a system that performs well on benchmarks and poorly in production.
Latency Is What Users Experience:
Latency is the time it takes for a single request to complete — from the moment a client sends a request to the moment it receives a response. It is the metric that users feel directly. A page that takes three seconds to load feels slow regardless of how many other users the system is simultaneously serving.
Latency is deceptively simple to define but difficult to measure meaningfully. Average latency hides the experience of users at the tail of the distribution. A system with an average latency of 50 milliseconds might have a 99th percentile latency of two seconds — meaning one in every hundred requests feels unacceptably slow even though the average looks fine.
Percentile latency — particularly p95, p99, and p99.9 — is the only meaningful way to understand what users are actually experiencing. Optimising for average latency while ignoring tail latency is one of the most common and most consequential performance mistakes.
Throughput Is What the System Can Handle:
Throughput is the number of requests a system can process per unit of time — requests per second, transactions per minute, messages per hour. It is a measure of capacity rather than speed. A system with high throughput can handle large volumes of work. A system with low throughput becomes a bottleneck as demand increases.
Throughput and latency are related but not equivalent. A system can have high throughput and high latency — processing many requests per second but taking a long time to complete each one. A system can have low latency and low throughput — responding quickly to individual requests but unable to handle many simultaneously.
Confusing throughput with latency leads to situations where teams add capacity to improve user experience but see no improvement, because the bottleneck was never capacity — it was the time taken to process each individual request.
Concurrency Is the Bridge Between the Two:
Concurrency is the number of requests a system is processing simultaneously at any given moment. It is not a performance goal in itself — it is a characteristic of how a system handles work, and it connects throughput and latency through a relationship known as Little's Law.
Little's Law states that the average number of requests in a system equals the average arrival rate multiplied by the average time each request spends in the system. In practical terms, this means that if your latency increases — requests take longer to process — your concurrency increases proportionally even if your throughput stays the same.
This is why latency spikes often precede throughput collapse. As latency increases, more requests pile up in flight simultaneously. Eventually the system runs out of threads, connections, or memory to handle them, and throughput drops sharply. The failure mode looks like a capacity problem but the root cause is a latency problem.
They Trade Off Against Each Other in Predictable Ways:
Batching is one of the clearest examples of the throughput-latency trade-off. Processing requests in batches rather than individually dramatically increases throughput — you amortise fixed costs across many requests. But it increases latency because individual requests must wait for a batch to fill before processing begins.
Message queues make the same trade-off. Decoupling producers from consumers via a queue allows the system to absorb traffic spikes and maintain high throughput — but requests sit in the queue before being processed, increasing latency.
Connection pooling trades concurrency management complexity for both throughput and latency improvements — by reusing connections rather than establishing new ones for each request, the system reduces per-request latency and increases the number of requests it can handle simultaneously.
Every performance optimisation involves trade-offs between these three dimensions. Understanding which dimension matters most for your system determines which optimisations are worth making.
The Right Metric Depends on What You Are Building:
For interactive user-facing applications — web pages, mobile apps, APIs that humans are waiting on — latency is the metric that matters most. Users do not care about throughput. They care about how long they are waiting.
For data processing systems — batch jobs, ETL pipelines, stream processing — throughput is typically the primary concern. These systems process large volumes of data where individual record latency is less important than overall pipeline completion time.
For systems handling unpredictable traffic spikes — payment processors, ticketing systems, flash sale platforms — concurrency management is critical. The system needs to handle simultaneous load without latency degrading catastrophically or throughput collapsing.
Most real systems care about all three to varying degrees. The art is knowing which one to prioritise when they conflict — and they always conflict eventually.
Measuring the Wrong Thing Leads to Wrong Conclusions:
A system that is optimised purely for throughput benchmarks can feel unusably slow to individual users. A system optimised purely for average latency can collapse under load because tail latency was never addressed. A system with high concurrency limits but poor latency characteristics will hit those limits faster than expected as latency increases under load.
Performance instrumentation needs to capture all three dimensions — not just the one that looks best in a benchmark. Throughput under sustained load, latency at multiple percentiles, and concurrency levels during traffic spikes together give a complete picture of how a system actually behaves.
Optimising based on incomplete measurements is one of the primary reasons performance work fails to produce real-world improvements.
Conclusion:
Throughput, latency, and concurrency are distinct dimensions of system performance that interact with each other in ways that are not always intuitive. Conflating them leads to optimisations that improve benchmark numbers without improving user experience, and to capacity decisions that address the wrong bottleneck.
The engineers who navigate performance problems most effectively are the ones who can precisely identify which dimension is causing the problem, understand how it interacts with the others, and choose interventions that address the actual constraint rather than the most visible symptom.
If this article helped you, you can support my work on AW Dev Rethought. Buy me a coffee
Enjoyed this post?
Stay in the loop
New posts + weekly digest, straight to your inbox.
Create a free account
- Save posts to your vault
- Like posts & build history
- New-post alerts
No comments yet. Be the first to comment!