Learn to create a robust, easy-to-scale architecture with cells (ARC335)
Resilience Journey: Architecting for Reliability and Scale
Typical Customer Journey
Organizations experience high-impact incidents
This leads to a drive to improve overall resiliency and efficiency of the architecture
The goal is to reduce detection and restoration time, and increase MTBF (mean time between failures)
Key Concepts
Blast Radius: Reducing the impact of failures by containing them within a defined boundary
Observability: Detecting problems early and mitigating them, reducing MTTD (mean time to detect) and MTTR (mean time to repair)
Deployments: Reducing the impact of bad deployments, increasing MTBF
Fault Isolation Boundaries
AWS Regions and Availability Zones (AZs) provide natural fault isolation boundaries
Architect workloads to reduce the impact of failures within these boundaries
Failure Modes
Gray Failures: Discrepancies between how different entities perceive the same issue
Poison Pill: A client can send a request that triggers an application bug and cascades across the infrastructure
Bad Deployments: Large blast radius can make rollbacks and recovery challenging
Cell-based Architecture
Copy the entire application stack within each "cell" (isolated boundary)
Cell state is not shared, data is partitioned
Reduces blast radius and enables easier scaling
Cell Designs
Multi-AZ Cells: Traffic does not cross AZ boundaries, leveraging native recovery services
Single-AZ Cells: Affinity keeps traffic within an AZ, providing more control over failures
Observability
Dimensionality adds context to logs and metrics, enabling drill-downs
Cell-aware observability provides granular insights into individual cell performance
Deployments
Single unit of deployment is the entire cell
Staggered deployments to individual cells minimize impact
Slack's Journey
Evolved from a single egress stack per VPC to a cell-based architecture
Key goals were reliability, reduced blast radius, and cost control
Detailed monitoring and alerting at both cluster and cell levels
Challenges around alert noise and deployment complexity
Key Takeaways
Cell-based architecture should solve both technical and business challenges
It may not be a perfect fit for all workloads, evaluate on a per-workload basis
Focus on reducing blast radius, improving observability, and safe deployments
Your Digital Journey deserves a great story.
Build one with us.
This website stores cookies on your computer.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.