Skip to content

Instantly share code, notes, and snippets.

View nerdalert's full-sized avatar
🐈
🦀 🐿

Brent Salisbury nerdalert

🐈
🦀 🐿
View GitHub Profile
cloned and made sure the nightly image was up to date.

docker run -it --ipc=host --network=host --group-add render \
    --privileged --security-opt seccomp=unconfined \
    --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
    --device=/dev/kfd --device=/dev/dri --device=/dev/mem \
    -e HF_TOKEN=$HF_TOKEN -e HF_HOME=/data/model_cache \
    -e MODEL=$MODEL \
```
docker run --rm -it --ipc=host --network=host --group-add render \
--privileged --security-opt seccomp=unconfined \
--cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
--device=/dev/kfd --device=/dev/dri --device=/dev/mem \
-e HF_TOKEN=$HF_TOKEN -e HF_HOME=/data/model_cache \
$ ./install.sh
+ set -e
+ set -o pipefail
++ command -v git
+ '[' -z /usr/bin/git ']'
++ command -v kubectl
+ '[' -z '/usr/local/bin/kubectl]'
./install.sh: line 10: [: missing `]'
$ python ./benchmark-e2e.py --port 8000 --model "meta-llama/Llama-3.2-1B" --cuda-device 0
Using port: 8000
Removing /home/ubuntu/vllm/benchmark-e2e/benchmark-compare
Removing /home/ubuntu/vllm/benchmark-e2e/venv-vllm
Removing /home/ubuntu/vllm/benchmark-e2e/venv-vllm-src
Removing /home/ubuntu/vllm/benchmark-e2e/venv-sgl
▶ git clone https://github.com/neuralmagic/benchmark-compare.git /home/ubuntu/vllm/benchmark-e2e/benchmark-compare
Cloning into '/home/ubuntu/vllm/benchmark-e2e/benchmark-compare'...
remote: Enumerating objects: 78, done.
$ go build
$ ./benchmark-go --port 8000 --model meta-llama/Llama-3.2-1B --cuda-device 0
[main] 2025/04/23 04:20:18 Using port: 8000
[main] 2025/04/23 04:20:18 Removing /home/ubuntu/vllm/benchmark-go/benchmark-compare
[main] 2025/04/23 04:20:18 Removing /home/ubuntu/vllm/benchmark-go/venv-vllm
[main] 2025/04/23 04:20:19 Removing /home/ubuntu/vllm/benchmark-go/venv-vllm-src
[main] 2025/04/23 04:20:19 Removing /home/ubuntu/vllm/benchmark-go/venv-sgl
[main] 2025/04/23 04:20:19 ▶ git clone https://github.com/neuralmagic/benchmark-compare.git /home/ubuntu/vllm/benchmark-go/benchmark-compare
Cloning into '/home/ubuntu/vllm/benchmark-go/benchmark-compare'...
podman run --rm -it     --network host     -e MODEL=meta-llama/Llama-3.2-1B     -e FRAMEWORK=vllm     -e HF_TOKEN="${HF_TOKEN}"     -e PORT=8000     -e HOST=172.31.37.101     -v "$(pwd)":/host:Z     -w /opt/benchmark     quay.io/bsalisbu/vllm-benchmark:latest

===== vllm - RUNNING meta-llama/Llama-3.2-1B FOR 120 PROMPTS WITH 1 QPS =====

INFO 04-23 01:38:59 [__init__.py:243] No platform detected, vLLM is running on UnspecifiedPlatform
Namespace(backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='meta-llama/Llama-3.2-1B', tokenizer=None, use_beam_search=False, num_prompts=120, logprobs=None, request_rate=1.0, burstiness=1.0, seed=1, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=True, save_detailed=False, metadata=['framework=vllm'], result_dir=None, result_filename='results.json', ignore_eos=True, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonn
{"date": "20250411-002323", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-llama/Llama-3.2-1B", "num_prompts": 120, "framework": "vllm", "request_rate": 1.0, "burstiness": 1.0, "max_concurrency": null, "duration": 99.19297069497406, "completed": 120, "total_input_tokens": 120000, "total_output_tokens": 12000, "request_throughput": 1.2097631430861078, "request_goodput:": null, "output_throughput": 120.97631430861077, "total_token_throughput": 1330.7394573947186, "mean_ttft_ms": 56.25359537589247, "median_ttft_ms": 55.28098650393076, "std_ttft_ms": 6.545660891106274, "p99_ttft_ms": 78.70767521642848, "mean_tpot_ms": 7.615017463035274, "median_tpot_ms": 7.524641732229014, "std_tpot_ms": 0.5137661324558762, "p99_tpot_ms": 8.92187871250578, "mean_itl_ms": 7.615019605385871, "median_itl_ms": 7.299988501472399, "std_itl_ms": 3.706068885790247, "p99_itl_ms": 8.394360903184861}
{"date": "20250411-002536", "backend": "vllm", "model_id": "meta-llama/Llama-3.2-1B", "tokenizer_id": "meta-l

| date | backend | model_id | tokenizer_id | num_prompts | framework | request_rate | burstiness | max_concurrency | duration | completed | total_input_tokens | total_output_tokens | request_throughput | request_goodput: | output_throughput | total_token_throughput | mean_ttft_ms | median_ttft_ms | std_ttft_ms | p99_ttft_ms | mean_tpot_ms | median_tpot_ms | std_tpot_ms | p99_tpot_ms | mean_itl_ms | median_itl_ms | std_itl_ms | p99_itl_ms | |:--------------------|:----------|:---------------------------------|:---------------------------------|--------------:|:------------|---------------:|-------------:|------------------:|-----------:|------------:|---------------------:|----------------------:|---------------------:|-------------------:|--------------------:|-------------------------:|---------------:|-----------------:|--------------:|--------------:|---------------:|-----------------:|-----------


Ilab UI API Server


1. GET /models

Purpose:

#!/usr/bin/env python3
"""
Example script that:
1. Converts a document to a DocLing Document.
2. Chunks it using a HybridChunker.
3. Embeds each chunk using a SentenceTransformer.
4. Stores them in a LanceDB index.
5. Searches for a user-provided query and returns the best matching chunk or all matching chunks based on a flag.
"""