AW Dev Rethought

🌟 The best way to predict the future is to invent it - Alan Kay

Systems Realities: Learning From Near-Misses in Production Systems


Introduction:

Not every failure becomes an outage.

In many systems, small issues occur regularly — a timeout that resolves itself, a retry that succeeds, a fallback that masks an error, or a dependency that briefly degrades and recovers.

These events are often dismissed because nothing “broke.”

But near-misses are some of the most valuable signals a system can provide. They reveal weaknesses without the cost of a full incident. Ignoring them means waiting for the same conditions to align again — often with worse consequences.


Near-Misses Reveal Hidden Weaknesses:

A near-miss is not a success.

It is a failure that was contained by chance, timing, or an existing safeguard. The underlying issue still exists.

Examples include:

  • retries hiding transient failures
  • fallbacks masking degraded services
  • partial outages that recover before escalation

These situations indicate that the system is closer to failure than it appears.


Retries Can Hide System Stress:

Retries often make systems appear resilient.

Requests succeed eventually, so the issue is considered resolved. However, retries increase load, amplify latency, and may indicate deeper instability.

A system relying heavily on retries is often compensating for underlying problems rather than solving them.


Fallbacks Can Mask Real Problems:

Fallback mechanisms are essential for resilience.

But they can also create false confidence. When a fallback activates, users may still receive a response, but the system is operating in a degraded mode.

If fallbacks are triggered frequently, they signal that primary systems are unreliable.


Transient Errors Are Not Always Harmless:

Short-lived failures are easy to ignore.

A brief spike in errors or latency may not impact users significantly, but it can indicate resource contention, scaling limits, or dependency issues.

Repeated transient errors often precede larger incidents.


Observability Must Capture Near-Misses:

Many monitoring systems focus only on hard failures.

Near-misses require tracking signals such as:

  • retry rates
  • fallback activations
  • latency spikes
  • partial failures

Without visibility into these patterns, systems appear healthier than they actually are.


Human Response Often Normalises Risk:

Teams become accustomed to small issues.

Engineers see occasional alerts, minor delays, or intermittent errors and begin to treat them as normal. Over time, this normalisation reduces urgency.

When the same pattern eventually leads to a major incident, the warning signs were already present.


Near-Misses Are Opportunities for Cheap Learning:

Incidents are expensive.

They impact users, require coordination, and consume engineering time. Near-misses offer the same learning opportunity without those costs.

Analysing near-misses allows teams to fix issues before they escalate.


Patterns Matter More Than Individual Events:

A single near-miss may not justify action.

Repeated patterns do.

Multiple small signals — retries, delays, degraded dependencies — often indicate systemic issues. Recognising these patterns early helps prevent cascading failures.


Designing Systems to Surface Near-Misses:

Systems should make near-misses visible.

This includes:

  • logging fallback usage
  • tracking retry behaviour
  • monitoring latency distributions
  • alerting on unusual patterns

Visibility turns hidden risk into actionable insight.


Conclusion:

Near-misses are not noise.

They are early warnings from the system — signals that something is not behaving as expected. Ignoring them delays learning until failure becomes unavoidable.

Strong systems are not defined by the absence of incidents, but by how quickly they learn from signals that precede them.


Rethought Relay:
Link copied!

Comments

Add Your Comment

Comment Added!