This snippet implements a simple strategy for model multiplexing in Ray Serve.
- It utilitizes Serve's autoscaling capability to adjust replica counts given traffic.
- It host a LRU cache per replica.
pip install ray[serve]
- Save the file into
app.py
serve run app:entrypoint
curl -X POST http://127.0.0.1:8000/predict/model_1
This setting should be sufficient to handle mostly uniform distribution. There are several optimization we can immediately implement and iterate on:
- Load the model asynchrnously in the background. This will lower other request's latency.
- Intelligently route the traffic the replica that has the model.
Nice!
cache_stats
might be existed as stats (no need to return).I am thinking to do smart routing thing in the router with simple hash function, and then we don't need to return
replica_id
.