Learn to create a robust, easy-to-scale architecture with cells (ARC335)

Resilience Journey: Architecting for Reliability and Scale

Typical Customer Journey

Organizations experience high-impact incidents
This leads to a drive to improve overall resiliency and efficiency of the architecture
The goal is to reduce detection and restoration time, and increase MTBF (mean time between failures)

Key Concepts

Blast Radius: Reducing the impact of failures by containing them within a defined boundary
Observability: Detecting problems early and mitigating them, reducing MTTD (mean time to detect) and MTTR (mean time to repair)
Deployments: Reducing the impact of bad deployments, increasing MTBF

Fault Isolation Boundaries

AWS Regions and Availability Zones (AZs) provide natural fault isolation boundaries
Architect workloads to reduce the impact of failures within these boundaries

Failure Modes

Gray Failures: Discrepancies between how different entities perceive the same issue
Poison Pill: A client can send a request that triggers an application bug and cascades across the infrastructure
Bad Deployments: Large blast radius can make rollbacks and recovery challenging

Cell-based Architecture

Copy the entire application stack within each "cell" (isolated boundary)
Cell state is not shared, data is partitioned
Reduces blast radius and enables easier scaling

Cell Designs

Multi-AZ Cells: Traffic does not cross AZ boundaries, leveraging native recovery services
Single-AZ Cells: Affinity keeps traffic within an AZ, providing more control over failures

Observability

Dimensionality adds context to logs and metrics, enabling drill-downs
Cell-aware observability provides granular insights into individual cell performance

Deployments

Single unit of deployment is the entire cell
Staggered deployments to individual cells minimize impact

Slack's Journey

Evolved from a single egress stack per VPC to a cell-based architecture
Key goals were reliability, reduced blast radius, and cost control
Detailed monitoring and alerting at both cluster and cell levels
Challenges around alert noise and deployment complexity

Key Takeaways

Cell-based architecture should solve both technical and business challenges
It may not be a perfect fit for all workloads, evaluate on a per-workload basis
Focus on reducing blast radius, improving observability, and safe deployments

Your Digital Journey deserves a great story.

Build one with us.

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.

Learn to create a robust, easy-to-scale architecture with cells (ARC335)

Resilience Journey: Architecting for Reliability and Scale

Typical Customer Journey

Key Concepts

Fault Isolation Boundaries

Failure Modes

Cell-based Architecture

Cell Designs

Observability

Deployments

Slack's Journey

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

Headquarters

Delivery Centre

Learn to create a robust, easy-to-scale architecture with cells (ARC335)

Resilience Journey: Architecting for Reliability and Scale

Typical Customer Journey

Key Concepts

Fault Isolation Boundaries

Failure Modes

Cell-based Architecture

Cell Designs

Observability

Deployments

Slack's Journey

Key Takeaways

Your Digital Journey deserves a great story.

Build one with us.

This website stores cookies on your computer.