https://github.com/Netflix/concurrency-limits のREADMEの和訳
最適なレイテンシとともに最適なスループットを得るためにサービスの並行制限を自動検出するための TCP輻輳制御を基にした概念を実装、統合するJavaライブラリ。
サービス可用性を考える時、運用者は伝統的にRPS(requests per second)のことを考える。 負荷試験は通常、サービスが倒れる点(※)のRPSを決定するために実施される。 RPS制限は この サービスが倒れる点より下(この値の75%と言われます)に設定され、トークンバケットによって強制されます。 しかしながら、オートスケールする大規模分散システムにおいて、この値はすぐに使い物にならなくなってしまいます。 そして、過剰な負荷を適切に弾くことが出来ず、サービスが応答しなくなることによって、サービスは停止します。
※ サービスのレイテンシが並行数が多くて、悪化に転じる点
RPSを考える代わりに、我々は、 キューが構築されるまで、そしてレイテンシが増えるまで、あるいはサービスが最終的にハードリミットを使い果たすまでにサービスが扱える並行リクエストを特定するための待ち行列理論を適用し 並行リクエストの観点から考えるべきでしょう。
この関係は Limit = 平均RPS * 平均レイテンシで表される リトルの法則でうまくカバーされています。
並行性の制限は強制することはとても簡単ですが、しかし、サービスを実行するハードウェアへの完全な理解とスケール方法の調整が運用者に要求されるため 並行性の制限を決定することが難しいです。
代わりに、我々はネットワークの各点で、並行性の制限を計測や推測するのを好みます。
システムがスケールして制限に引っかかると、各ノードはそれぞれのノードごとが持つ制限を調整し強制します。 制限を推測するために、TCP congestion windowとシステムの並行性の制限を等しくみなすことによって、一般的なTCP輻輳制御アルゴリズムを取り入れました。
Before applying the algorithm we need to set some ground rules.
- We accept that every system has an inherent concurrency limit that is determined by a hard resources, such as number of CPU cores.
- We accept that this limit can change as a system auto-scales.
- For large and complex distributed systems it's impossible to know all the hard resources.
- We can use latency measurements to determine when queuing happens.
- We can use timeouts and rejected requests to aggressively back off.
Delay based algorithm where the bottleneck queue is estimated as
L * (1 - minRTT/sampleRtt)
At the end of each sampling window the limit is increased by 1 if the queue is less than alpha (typically a value between 2-3) or decreased by 1 if the queue is greater than beta (typically a value between 4-6 requests)
This algorithm attempts to address bias and drift when using minimum latency measurements. To do this the algorithm tracks uses the measure of divergence between two exponential averages over a long and short time time window. Using averages the algorithm can smooth out the impact of outliers for bursty traffic. Divergence duration is used as a proxy to identify a queueing trend at which point the algorithm aggresively reduces the limit.
In the simplest use case we don't want to differentiate between requests and so enforce a single gauge of the number of inflight requests. Requests are rejected immediately once the gauge value equals the limit.
For more complex systems it's desirable to provide certain quality of service guarantees while still making efficient use of resources. Here we guarantee specific types of requests get a certain percentage of the concurrency limit. For example, a system that takes both live and batch traffic may want to give live traffic 100% of the limit during heavy load and is OK with starving batch traffic. Or, a system may want to guarantee that 50% of the limit is given to write traffic so writes are never starved.
A concurrency limiter may be installed either on the server or client. The choice of limiter depends on your use case. For the most part it is recommended to use a dynamic delay based limiter such as the VegasLimit on the server and either a pure loss based (AIMDLimit) or combined loss and delay based limiter on the client.
The purpose of the server limiter is to protect the server from either increased client traffic (batch apps or retry storms) or latency spikes from a dependent service. With the limiter installed the server can ensure that latencies remain low by rejecting excess traffic with Status.UNAVAILABLE
errors.
In this example a GRPC server is configured with a single adaptive limiter that is shared among batch and live traffic with live traffic guaranteed 90% of throughput and 10% guaranteed to batch. For simplicity we just expect the client to send a "group" header identifying it as 'live' or 'batch'. Ideally this should be done using TLS certificates and a server side lookup of identity to grouping. Any requests not identified as either live or batch may only use excess capacity.
// Create and configure a server builder
ServerBuilder builder = ...;
builder.addService(ServerInterceptor.intercept(service,
ConcurrencyLimitServerInterceptor.newBuilder(
new GrpcServerLimiterBuilder()
.partitionByHeader(GROUP_HEADER)
.partition("live", 0.9)
.partition("batch", 0.1)
.limit(WindowedLimit.newBuilder()
.build(Gradient2Limit.newBuilder()
.build()))
.build();
));
There are two main use cases for client side limiters. A client side limiter can protect the client service from its dependent services by failing fast and serving a degraded experience to its client instead of having its latency go up and its resources eventually exhausted. For batch applications that call other services a client side limiter acts as a backpressure mechanism ensuring that the batch application does not put unnecessary load on dependent services.
In this example a GRPC client will use a blocking version of the VegasLimit to block the caller when the limit has been reached.
// Create and configure a channel builder
ChannelBuilder builder = ...;
// Add the concurrency limit interceptor
builder.intercept(
new ConcurrencyLimitClientInterceptor(
new GrpcClientLimiterBuilder()
.blockOnLimit(true)
.build()
)
);
The purpose of the servlet filter limiter is to protect the servlet from either increased client traffic (batch apps or retry storms) or latency spikes from a dependent service. With the limiter installed the server can ensure that latencies remain low by rejecting excess traffic with HTTP 429 Too Many Requests errors.
In this example a servlet is configured with a single adaptive limiter that is shared among batch and live traffic with live traffic guaranteed 90% of throughput and 10% guaranteed to batch. The limiter is given a lookup function that translates the request's Principal to one of the two groups (live vs batch).
Map<String, String> principalToGroup = ...;
Filter filter = new ConcurrencyLimitServletFilter(new ServletLimiterBuilder()
.partitionByUserPrincipal(principal -> principalToGroup.get(principal.getName())
.partition("live", 0.9)
.partition("batch", 0.1))
.build());
The BlockingAdaptiveExecutor adapts the size of an internal thread pool to match the concurrency limit based on measured latencies of Runnable commands and will block when the limit has been reached.
public void drainQueue(Queue<Runnable> tasks) {
Executor executor = new BlockingAdaptiveExecutor(
SimpleLimiter.newBuilder()
.build());
while (true) {
executor.execute(tasks.take());
}
}