Skip to content

Instantly share code, notes, and snippets.

@simon-mo
Created September 20, 2022 20:56
Show Gist options
  • Save simon-mo/4931e0a8eaf741aba7174f7fd3b00531 to your computer and use it in GitHub Desktop.
Save simon-mo/4931e0a8eaf741aba7174f7fd3b00531 to your computer and use it in GitHub Desktop.

Simple Model Multiplexing with Ray Serve

This snippet implements a simple strategy for model multiplexing in Ray Serve.

  • It utilitizes Serve's autoscaling capability to adjust replica counts given traffic.
  • It host a LRU cache per replica.

Run it

  • pip install ray[serve]
  • Save the file into app.py
  • serve run app:entrypoint
  • curl -X POST http://127.0.0.1:8000/predict/model_1

Next step

This setting should be sufficient to handle mostly uniform distribution. There are several optimization we can immediately implement and iterate on:

  • Load the model asynchrnously in the background. This will lower other request's latency.
  • Intelligently route the traffic the replica that has the model.
from ray import serve
from functools import lru_cache
from fastapi import FastAPI
from ray.serve.context import get_internal_replica_context
app = FastAPI()
class Model:
def __init__(self, model_id):
self.model_id = model_id
def predict(self):
return self.model_id
@lru_cache(maxsize=32)
def load_model(model_id):
return Model(model_id)
@serve.deployment
@serve.ingress(app)
class MultiplexedModels:
@app.post("/predict/{model_id}")
def predict(self, model_id: str):
model = load_model(model_id)
return {
"result": model.predict(),
"cache_stats": load_model.cache_info()._asdict(),
"replica_id": get_internal_replica_context().replica_tag,
}
entrypoint = MultiplexedModels.options(
autoscaling_config={"min_replicas": 1, "max_replicas": 4},
route_prefix="/",
).bind()
@sihanwang41
Copy link

Nice! cache_stats might be existed as stats (no need to return).

I am thinking to do smart routing thing in the router with simple hash function, and then we don't need to return replica_id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment