Maintaining a resilient front door at massive scale, from Netflix

Netflix is responsible for about 1/3 of downstream traffic in NA
Responsible team in the company is called "edge engineering"
- Apart from resiliency/scaling, also cares about high velocity product innovation and real time health insights
Basic architecture:
- end-user devices make requests to ELBs, which delegates to zuul, which routes to origin servers serving APIs
Zuul
- Multi-region resiliency
  - Cross-region failover: if us-east has a failure it routes to us-west
  - This also comes with a DNS change so after propagation consumers are just redirected to another region
- Dynamic routing
  - Route some users to a debug version of the API
  - Route a % of the traffic to a newer version
  - Managed via a web interface
- Security and authentication!
- Squeeze test
  - To allow performance testing, find breaking points, etc
Numbers:
- Request ratio is 7:1 (each incoming request results in 7 internal service calls)
- 5bi req/day
- 30 dependent services
- 0 of them have a 100% SLA
  - If they all have 99.99% uptime, together the system would only have 99.7% (99.99 ^ 30)
Hystrix
- Toggle circuit breakers
  - They do percent-based breakers
  - Sample fallback: when the custom rating service is down they just show the average rates
- Health of all dependencies
- Really nice consended view of the health of the system:
  - Error rate
  - Queue status
  - Response times
Grepzilla
- Realtime, distributed tail+grep
- Seems to cover some of the use cases around Splunk, in the command line
Spinnaker
- New tool Netflix is developing to manage AWS resources
"The possibilities are numerous once we decide to act and not react"
Reactive auto scaling
- React to real time conditions
- Respond to spikes/dips in metrics
  - eg: load averages, req/sec, etc
- Excellent for many scenarios
- But comes with challenges:
  - Policies can be inefficient when traffic patterns vary
  - Performance degradation during instance startup
  - Outages can trigger scale down events
  - Excess capacity
Scryer: predictive auto scaling system
- Evaluate needs based on historical date
  - week over week, month over month
- Adjust minimums (at any time, only set the minimum you're expecting, let the reactive auto scaler set the maximums)
- Good results in production
  - During outages the drop in requests normally results in a spike due to pent up requests. By set the minimum Scryer made sure the system was ready to handle the load after it came back.
  - Saved money
Other takeaways:
- Timeout and retry configuration require lots of attention
- Fallbacks when circuit breakers are open are important (eg: how they show avg title rates instead of the user ones)

pedro/netflix-maintaining-resilient-front-door.md

Maintaining a resilient front door at massive scale, from Netflix