A primer

Debugging and triaging issues in distributed systems is non-trivial – in fact, I’d argue that thinking and building for distributed systems makes most professional software engineer jobs difficult; more than coding. Coding up an application is easy, but not when you run into the scenarios of scalability, rate limiting, large data volumes, or high-IOPS nearRT streams of 1+ million records/seconds.

It’s already difficult getting a single machine to agree to operate the way you want; now, you’re thrusted into a universe involving multiple machines and networks, each of which introduces unexpected failure domains. Let’s dive into some basic issues plaguing such systems.

Message Transmission Failures

Conducting a root cause analysis on message transmission failures between a single client and a single server is difficult? What’s the cause of transmission failure? Was it even transmission failure? A processing failure? Can we even triage the sources of failures? Let’s look at four failure modes with the simplest distributed system known to people-kind ( yes, women-kind and man-kind ) : two standalone machines/nodes communicating with each other in an HTTP-esque request-response style pattern.

Client-side process started but fails to process payload.
Client-side application sends payload over on request, but message drops before reaching server.
Server-side process received the message, but failed to process succesfully.
Server-side process sends payload over on response, but message drops before reaching client.
( https://excalidraw.com/#room=60dbd3c682e2803e291f,KU69vHg4E_ifgZp7NvNzSA )

**The simplest distributed system – a client and a server sending and receiving payloads. Think of a basic SYN-ACK from an undergraduate intro to computer networks class.**

Given this set up, what do we do next?

In the ideal world, where we could run off the assumption that both machines could persist a healthy, long-term network relationship, we can minimize the level of logging, tracing, and debugging. Systems would run smoothly if network communication worked 100% of the time. But that’s not the case. Networks will fail. They fail all the time. And we need to be systems that account for failing networks.

More ever, we typically have access to only one of these machines in most real-world business scenarios. Sometimes, this is due to legal, compliance, and organizational reasons ( e.g. different Enterprise organizations or teams own their own microservices ). Or due to service boundaries when communicating across B2B ( business-to-business ) or B2C ( business-to-customer ) applications. In such a case, we access only a client side application sending outbound requests to third-party APIs or a server-side application processing inbound requests via /REST-style API endpoints ). Logging and telemetry reach their limits too; in a world with four moving components of potential failure, we access only one.

Network Failures

Computer networks can fail for a multitude of reasons. Let’s list them out

Data center outages
Planned maintenance work ( e.g. an AWS service under upgrade )
Unplanned maintenance work
Machine failure ( from age or overload )
Network overload ( e.g. (Distributed) Denial of Service attacks, metastable failure, faulty rate limiting )
Natural disasters – floods, hurricanes, heat waves disrupting components
Random, non-deterministic sources ( e.g. within Cat-5E ethernet cables, a single packet of light fails to transmit from source to destination ).

Process Crash Failures

Insufficient machine resources – CPU, Disk, RAM
Unexpected data volume
Thundering herd scenario ( DDOS, Prime Day sales, or popular celebrity search spikes ).
Black swan events / non-deterministic causes – machines aging or single bits at CPU register levels having a one-off error
Run-time program failures ( e.g. call stack depth exceeded, infinite loops, failures to process object types from callee to caller )
Atypical or incorrect input formats
Hanging processing on communication to other components ( e.g. async database read or writes ).

harisrid Tech News

recent posts

about

SYSTEMS – The many failures and faults in our distributed systems

A primer

Message Transmission Failures

Network Failures

Process Crash Failures

Leave a comment Cancel reply

recent posts

about

SYSTEMS – The many failures and faults in our distributed systems

A primer

Message Transmission Failures

Network Failures

Process Crash Failures

Share this:

Leave a comment Cancel reply