-
Netflix is responsible for about 1/3 of downstream traffic in NA
-
Responsible team in the company is called "edge engineering"
- Apart from resiliency/scaling, also cares about high velocity product innovation and real time health insights
-
Basic architecture:
- end-user devices make requests to ELBs, which delegates to zuul, which routes to origin servers serving APIs
-
Zuul
- Multi-region resiliency
- Cross-region failover: if us-east has a failure it routes to us-west
- This also comes with a DNS change so after propagation consumers are just redirected to another region
- Dynamic routing
- Route some users to a debug version of the API
- Route a % of the traffic to a newer version
- Managed via a web interface
- Security and authentication!
- Squeeze test
- To allow performance testing, find breaking points, etc
- Multi-region resiliency
-
Numbers:
- Request ratio is 7:1 (each incoming request results in 7 internal service calls)
- 5bi req/day
- 30 dependent services
- 0 of them have a 100% SLA
- If they all have 99.99% uptime, together the system would only have 99.7% (99.99 ^ 30)
-
Hystrix
- Toggle circuit breakers
- They do percent-based breakers
- Sample fallback: when the custom rating service is down they just show the average rates
- Health of all dependencies
- Really nice consended view of the health of the system:
- Error rate
- Queue status
- Response times
- Toggle circuit breakers
-
Grepzilla
- Realtime, distributed tail+grep
- Seems to cover some of the use cases around Splunk, in the command line
-
Spinnaker
- New tool Netflix is developing to manage AWS resources
-
"The possibilities are numerous once we decide to act and not react"
-
Reactive auto scaling
- React to real time conditions
- Respond to spikes/dips in metrics
- eg: load averages, req/sec, etc
- Excellent for many scenarios
- But comes with challenges:
- Policies can be inefficient when traffic patterns vary
- Performance degradation during instance startup
- Outages can trigger scale down events
- Excess capacity
-
Scryer: predictive auto scaling system
- Evaluate needs based on historical date
- week over week, month over month
- Adjust minimums (at any time, only set the minimum you're expecting, let the reactive auto scaler set the maximums)
- Good results in production
- During outages the drop in requests normally results in a spike due to pent up requests. By set the minimum Scryer made sure the system was ready to handle the load after it came back.
- Saved money
- Evaluate needs based on historical date
-
Other takeaways:
- Timeout and retry configuration require lots of attention
- Fallbacks when circuit breakers are open are important (eg: how they show avg title rates instead of the user ones)
Created
November 13, 2014 01:06
-
-
Save pedro/c3f3ed59a19f01f26439 to your computer and use it in GitHub Desktop.
Maintaining a resilient front door at massive scale, from Netflix
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment