https://github.com/Netflix/concurrency-limits のREADMEの和訳

revision https://raw.githubusercontent.com/Netflix/concurrency-limits/eef952837349116019c4ab8b234b92a74ce58814/README.md

Overview

最適なレイテンシとともに最適なスループットを得るためにサービスの並行制限を自動検出するための TCP輻輳制御を基にした概念を実装、統合するJavaライブラリ。

Background

サービス可用性を考える時、運用者は伝統的にRPS（requests per second）のことを考える。負荷試験は通常、サービスが倒れる点（※）のRPSを決定するために実施される。 RPS制限はこのサービスが倒れる点より下（この値の75%と言われます）に設定され、トークンバケットによって強制されます。しかしながら、オートスケールする大規模分散システムにおいて、この値はすぐに使い物にならなくなってしまいます。そして、過剰な負荷を適切に弾くことが出来ず、サービスが応答しなくなることによって、サービスは停止します。

※ サービスのレイテンシが並行数が多くて、悪化に転じる点

RPSを考える代わりに、我々は、キューが構築されるまで、そしてレイテンシが増えるまで、あるいはサービスが最終的にハードリミットを使い果たすまでにサービスが扱える並行リクエストを特定するための待ち行列理論を適用し並行リクエストの観点から考えるべきでしょう。

この関係は Limit = 平均RPS * 平均レイテンシで表されるリトルの法則でうまくカバーされています。

並行性の制限は強制することはとても簡単ですが、しかし、サービスを実行するハードウェアへの完全な理解とスケール方法の調整が運用者に要求されるため並行性の制限を決定することが難しいです。

代わりに、我々はネットワークの各点で、並行性の制限を計測や推測するのを好みます。

システムがスケールして制限に引っかかると、各ノードはそれぞれのノードごとが持つ制限を調整し強制します。制限を推測するために、TCP congestion windowとシステムの並行性の制限を等しくみなすことによって、一般的なTCP輻輳制御アルゴリズムを取り入れました。

Before applying the algorithm we need to set some ground rules.

We accept that every system has an inherent concurrency limit that is determined by a hard resources, such as number of CPU cores.
We accept that this limit can change as a system auto-scales.
For large and complex distributed systems it's impossible to know all the hard resources.
We can use latency measurements to determine when queuing happens.
We can use timeouts and rejected requests to aggressively back off.

Limit algorithms

Vegas

Delay based algorithm where the bottleneck queue is estimated as

L * (1 - minRTT/sampleRtt)

At the end of each sampling window the limit is increased by 1 if the queue is less than alpha (typically a value between 2-3) or decreased by 1 if the queue is greater than beta (typically a value between 4-6 requests)

Gradient2

This algorithm attempts to address bias and drift when using minimum latency measurements. To do this the algorithm tracks uses the measure of divergence between two exponential averages over a long and short time time window. Using averages the algorithm can smooth out the impact of outliers for bursty traffic. Divergence duration is used as a proxy to identify a queueing trend at which point the algorithm aggresively reduces the limit.

Enforcement Strategies

Simple

In the simplest use case we don't want to differentiate between requests and so enforce a single gauge of the number of inflight requests. Requests are rejected immediately once the gauge value equals the limit.

Percentage

For more complex systems it's desirable to provide certain quality of service guarantees while still making efficient use of resources. Here we guarantee specific types of requests get a certain percentage of the concurrency limit. For example, a system that takes both live and batch traffic may want to give live traffic 100% of the limit during heavy load and is OK with starving batch traffic. Or, a system may want to guarantee that 50% of the limit is given to write traffic so writes are never starved.

Integrations

GRPC

A concurrency limiter may be installed either on the server or client. The choice of limiter depends on your use case. For the most part it is recommended to use a dynamic delay based limiter such as the VegasLimit on the server and either a pure loss based (AIMDLimit) or combined loss and delay based limiter on the client.

Server limiter

The purpose of the server limiter is to protect the server from either increased client traffic (batch apps or retry storms) or latency spikes from a dependent service. With the limiter installed the server can ensure that latencies remain low by rejecting excess traffic with Status.UNAVAILABLE errors.

In this example a GRPC server is configured with a single adaptive limiter that is shared among batch and live traffic with live traffic guaranteed 90% of throughput and 10% guaranteed to batch. For simplicity we just expect the client to send a "group" header identifying it as 'live' or 'batch'. Ideally this should be done using TLS certificates and a server side lookup of identity to grouping. Any requests not identified as either live or batch may only use excess capacity.

// Create and configure a server builder
ServerBuilder builder = ...;

builder.addService(ServerInterceptor.intercept(service,
    ConcurrencyLimitServerInterceptor.newBuilder(
        new GrpcServerLimiterBuilder()
            .partitionByHeader(GROUP_HEADER)
            .partition("live", 0.9)
            .partition("batch", 0.1)
            .limit(WindowedLimit.newBuilder()
                    .build(Gradient2Limit.newBuilder()
                            .build()))
            .build();

    ));

Client limiter

There are two main use cases for client side limiters. A client side limiter can protect the client service from its dependent services by failing fast and serving a degraded experience to its client instead of having its latency go up and its resources eventually exhausted. For batch applications that call other services a client side limiter acts as a backpressure mechanism ensuring that the batch application does not put unnecessary load on dependent services.

In this example a GRPC client will use a blocking version of the VegasLimit to block the caller when the limit has been reached.

// Create and configure a channel builder
ChannelBuilder builder = ...;

// Add the concurrency limit interceptor
builder.intercept(
    new ConcurrencyLimitClientInterceptor(
        new GrpcClientLimiterBuilder()
            .blockOnLimit(true)
            .build()
        )
    );

Servlet Filter

The purpose of the servlet filter limiter is to protect the servlet from either increased client traffic (batch apps or retry storms) or latency spikes from a dependent service. With the limiter installed the server can ensure that latencies remain low by rejecting excess traffic with HTTP 429 Too Many Requests errors.

In this example a servlet is configured with a single adaptive limiter that is shared among batch and live traffic with live traffic guaranteed 90% of throughput and 10% guaranteed to batch. The limiter is given a lookup function that translates the request's Principal to one of the two groups (live vs batch).

Map<String, String> principalToGroup = ...;
Filter filter = new ConcurrencyLimitServletFilter(new ServletLimiterBuilder()
    .partitionByUserPrincipal(principal -> principalToGroup.get(principal.getName())
    .partition("live", 0.9)
    .partition("batch", 0.1))
    .build());

Executor

The BlockingAdaptiveExecutor adapts the size of an internal thread pool to match the concurrency limit based on measured latencies of Runnable commands and will block when the limit has been reached.

public void drainQueue(Queue<Runnable> tasks) {
    Executor executor = new BlockingAdaptiveExecutor(
        SimpleLimiter.newBuilder()
            .build());
    
    while (true) {
        executor.execute(tasks.take());
    }
}

https://medium.com/@NetflixTechBlog/performance-under-load-3e6fa9a60581 の和訳

負荷時のパフォーマンス

Netflixでは、サービス可用性に取り憑かれておりどのように我々のゴールを手に入れるかいくつかのブログを何年にも渡って書いてきました。

それらのテクニックのなかにはサーキットブレイカーや並列制限、カオステスティングなどがあります。

今日、私達が発表するのは我々の直近のイノべーションである、adaptive concurrency limitsです。（適用的並列制限）

Adaptive concurrency limitsは根本的に非常に高負荷な環境でのアプリケーションの動作を改善し、サービス障害のカスケードを避けることが出来ます。

私達はシステムの並列制限を決めること辛いタスクを排除しました、レイテンシが低いことを保証したまま。また、このアナウンスメントとともに、シンプルなJava Libraryをオープンソースにしました。サーブレットやExecutor、GRPCなどのインテグレーションも含まれています。

A little background first

並列性とはシステムがどんなときでも動作するリクエストの数に他ならず通常、CPUのような固定化されたリソースによって決まります。

システムの並列性は状態と共にLittele lawを使って計算されます。 Steady Stateのシステムの場合、並列性は動作時間と平均動作レートの積です。

この並列性を超えたリクエストはすぐに処理されることはできずキューに入るかリジェクトされるべきでしょう。

そんなわけで、完全にシステムの利用を可能にするにはいくつかのキューイングは必要です、不均一なリクエストの到達とサービス時間にも関わらず。

このキューに制限を持たない場合、システムは失敗します、例えば終了率が到着率を超える長時間の間。

キューが大きくなるにつれて、すべての要求がタイムアウトし始めてシステムが最終的にメモリ不足になりクラッシュするまでの待ち時間が長くなります。

未チェックのままにしておくとレイテンシの増加は呼び出し側に悪影響を及ぼしはじめそのシステムを通して、障害が広がるのを導きます。

並列数の制限を強制することは新しいことではありません。難しいのは、この制限を導き出すことです、並列性やレイテンシの特性が絶えず変化している大規模でダイナミックな分散システムにおいて。

この制限はパフォーマンスが低下する前に許可されているインフライトリクエストの最大数とみなすことができるでしょう。

The solution

歴史的に、Netflixでは、パフォーマンステストやプロファイリングの辛いプロセスを通して計測して固定した並列性を手動で設定しています。

これはその時点では正確な値でしたが、すぐにその測られた制限は古くなります。部分的な機能停止やオートスケーリング、もしくはレイテンシ特性に影響のあるコードの変更によってシステムのトポロジが変化すると。

静的な並列制限よりもよくなる方法を知っていたので、私達は自動的にシステム固有の並列制限を自動的に特定しようとしました、以下のような方法で。

Require no manual work （人の手を必要としない）
Require no centralized coordination （一元的な管理が必要ない）
Infer the limit without any knowledge of the hardware or system topology （ハードウェアやシステムトポロジへの知識なしに制限を推測する）
Adapt to changes in system topology（システムトポロジの変更に適用する）
Easy to compute and enforce（計算が簡単で強制することも簡単）

この問題を解決するために、タイムアウトやレイテンシの悪化を負うことなく、並行で送信できるパケット数を判断しようとする試行錯誤された現実のTCP輻輳制御アルゴリズムに目を向けました。

これらのアルゴリズムは、システムの並列制限を推測し、絶えずcongestion window sizeを調整するために、様々なメトリクスの追跡します。

ここに青の並列性が不明なシステムがあります。クライアントは低い並列数でリクエストを送り始める一方で、レイテンシが増えないように輻輳ウィンドウを広げて絶えずより高い並行性を探索します。レイテンシが増えたら、送信元は制限に達したとみなし、より小さな輻輳ウィンドウサイズに戻します。この制限は継続的に探索され、一般的なのこぎり歯のパターンになります。

我々のアルゴリズムは待ち行列が構築され、レイテンシの上昇していることを識別するための指標として、最小レイテンシ（待ち行列がなしの最良のシナリオを表す）と時間でサンプルされたレイテンシの計測比を調べる待ち時間をベースにしたTCP輻輳制御の上になりたっています。

この比はレイテンシの変化の勾配や大きさを教えてくれます。（gradient = RTTnoload/RTTactual） 1 という値は何も待ちがなく、制限が増やすことが出来ることを意味します。１より小さな値は過剰な待ち行列が構築されており、制限を下げるべきということを表します。この比を使ってサンプルごとに制限を調整し、次のシンプルな公式を使って可能な待ち行列のサイズを広げます。

newLimit = currentLimit × gradient + queueSize

いくつかの試行の後、アルゴリズムはレイテンシを低くキープする制限に収束する一方でバーストを考慮したいくつかの待ち行列を許可します。許容する待ち行列のサイズは調整可能で、制限がどれだけ早く増加するかを判断するために私用されます。待ち行列のサイズの計算方法として、現在の並行性の制限の平方根を良いデフォルトに設定しました。(※)

※ Javadocを読んだら、そう書いてある

google翻訳まま

平方根を選択したのは主に、小さい数に対する現在の制限に比べて大きいという有用な特性があり、それによってより速い成長を可能にしますが、より高い安定性のためにより大きい数に対しては減少するためです。

Adaptive Limits in Action

有効にすると、適応的なサーバサイトの制限は過剰なRPSを弾き、レイテンシを低いままキープします。一方で、インスタンス自身の保護とそれに依存するサービスを守ることができます。

過剰なトラフィックを弾かなかれば、RPSやレイテンシの増加し、以前はさらに悪いレイテンシや最終的にはシステム障害に繋がりました。サービスは今では過剰な負荷を受け流すことができ、レイテンシは低いまま、オートスケーリングなどの他の緩和策を取れるようになりました。

（協調を必要とすることなしに）サーバーレベルで制限を強制され、各サーバへのトラフィックがかなり集中的になる可能性があることに注意することが重要です。探索された制限と並列リクエスト数はそれゆえ、サーバ間で変わります。特にマルチテナントのクラウド環境では。これによって、他の場所では余裕はある時でも、あるサーバによって受け流されるようになりました。そうはいっても、クライアントサイドロードバランシングを使えば、1回のクライアントのリトライでキャパシティに余裕のあるインスタンスに接続しにいくことはほぼ100%可能です。さらに良いことに、DDOSを引き起こすようなリトライについての心配はもう必要ありません。そして、サービスに吹き荒れるリトライの嵐をわずか数ミリ秒でパフォーマンスへの影響を最小限に抑えながらトラフィックを受け流すことができます。

Conclusion

adaptive concurrency limitsを運用すれば、私達はこもりをする必要やサービスの負荷を軽減する方法を手動でチューニングする必要もなくなります。更に、導入するとマイクロサービスベースのエコシステムの全体のリライアビリティや可用性が向上するでしょう。

Netflix/concurrency-limitsで小さなオープンソースのライブラリとして、実装とインテグレーションを共有できることにワクワクしています。我々の望みは、カスケード障害や負荷によるレイテンシのデグラデーションからサービスを守ることに興味のある人なら誰でも、よりよい可用性を得るために我々のコードを利用できることです。

コミュニティのフィードバックを楽しみにすると同時に新しいアルゴリズムやインテグレーションのPull Requestを受け取ることができて嬉しく思います。

wreulicke/README.md