Learn to create a robust, easy-to-scale architecture with cells (ARC335)

Resilience Journey: Architecting for Reliability and Scale

Typical Customer Journey

  • Organizations experience high-impact incidents
  • This leads to a drive to improve overall resiliency and efficiency of the architecture
  • The goal is to reduce detection and restoration time, and increase MTBF (mean time between failures)

Key Concepts

  1. Blast Radius: Reducing the impact of failures by containing them within a defined boundary
  2. Observability: Detecting problems early and mitigating them, reducing MTTD (mean time to detect) and MTTR (mean time to repair)
  3. Deployments: Reducing the impact of bad deployments, increasing MTBF

Fault Isolation Boundaries

  • AWS Regions and Availability Zones (AZs) provide natural fault isolation boundaries
  • Architect workloads to reduce the impact of failures within these boundaries

Failure Modes

  1. Gray Failures: Discrepancies between how different entities perceive the same issue
  2. Poison Pill: A client can send a request that triggers an application bug and cascades across the infrastructure
  3. Bad Deployments: Large blast radius can make rollbacks and recovery challenging

Cell-based Architecture

  • Copy the entire application stack within each "cell" (isolated boundary)
  • Cell state is not shared, data is partitioned
  • Reduces blast radius and enables easier scaling

Cell Designs

  1. Multi-AZ Cells: Traffic does not cross AZ boundaries, leveraging native recovery services
  2. Single-AZ Cells: Affinity keeps traffic within an AZ, providing more control over failures

Observability

  • Dimensionality adds context to logs and metrics, enabling drill-downs
  • Cell-aware observability provides granular insights into individual cell performance

Deployments

  • Single unit of deployment is the entire cell
  • Staggered deployments to individual cells minimize impact

Slack's Journey

  • Evolved from a single egress stack per VPC to a cell-based architecture
  • Key goals were reliability, reduced blast radius, and cost control
  • Detailed monitoring and alerting at both cluster and cell levels
  • Challenges around alert noise and deployment complexity

Key Takeaways

  1. Cell-based architecture should solve both technical and business challenges
  2. It may not be a perfect fit for all workloads, evaluate on a per-workload basis
  3. Focus on reducing blast radius, improving observability, and safe deployments

Your Digital Journey deserves a great story.

Build one with us.

Cookies Icon

These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information in order to improve and customize your browsing experience and for analytics and metrics about our visitors on this website.

If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference not to be tracked.

Talk to us