-
Launch an Ubuntu 24.04,
m7i.4xlarge
instance (16 vCPU, 64 GB memory). Change storage to500GB
. -
Install Docker:
# Add Docker's official GPG key: sudo apt-get -y update sudo apt-get -y install ca-certificates curl sudo install -m 0755 -d /etc/apt/keyrings sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc sudo chmod a+r /etc/apt/keyrings/docker.asc # Add the repository to Apt sources: echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get -y update sudo apt-get -y install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
- Using AWS Console, Launch an Amazon Linux 2023,
m7i.4xlarge
instance (16 vCPU, 64 GB memory). Change storage to500GB
. - Install Docker:
sudo yum update -y sudo yum install -y docker sudo service docker start #sudo usermod -a -G docker ec2-user
- Install Docker Compose:
sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose
- Pull OPEA Docker images:
sudo docker pull opea/chatqna:latest sudo docker pull opea/chatqna-conversation-ui:latest
- Replace HuggingFace API token and private IP address of the host below and copy the contents in a file named
.env
:host_ip=172.31.37.13 #private IP address of the host no_proxy=${host_ip} HUGGINGFACEHUB_API_TOKEN="Your_Huggingface_API_Token" EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5" RERANK_MODEL_ID="BAAI/bge-reranker-base" LLM_MODEL_ID="Intel/neural-chat-7b-v3-3" TEI_EMBEDDING_ENDPOINT="http://${host_ip}:6006" TEI_RERANKING_ENDPOINT="http://${host_ip}:8808" TGI_LLM_ENDPOINT="http://${host_ip}:9009" REDIS_URL="redis://${host_ip}:6379" INDEX_NAME="rag-redis" REDIS_HOST=${host_ip} MEGA_SERVICE_HOST_IP=${host_ip} EMBEDDING_SERVICE_HOST_IP=${host_ip} RETRIEVER_SERVICE_HOST_IP=${host_ip} RERANK_SERVICE_HOST_IP=${host_ip} LLM_SERVICE_HOST_IP=${host_ip} BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/chatqna" DATAPREP_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/dataprep" DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_file" DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_file"
- Download Docker Compose file:
curl -O https://raw.githubusercontent.com/opea-project/GenAIExamples/main/ChatQnA/docker_compose/intel/cpu/xeon/compose.yaml
- Start the application:
sudo docker compose -f compose.yaml up -d
- Verify the list of containers:
ubuntu@ip-172-31-79-111:~$ sudo docker container ls CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 29f3a466d175 opea/chatqna-ui:latest "docker-entrypoint.s…" 4 minutes ago Up 4 minutes 0.0.0.0:5173->5173/tcp, :::5173->5173/tcp chatqna-xeon-ui-server 1020fa2a75c2 opea/chatqna:latest "python chatqna.py" 4 minutes ago Up 4 minutes 0.0.0.0:8888->8888/tcp, :::8888->8888/tcp chatqna-xeon-backend-server 02112b28ee54 opea/dataprep-redis:latest "python prepare_doc_…" 4 minutes ago Up 4 minutes 0.0.0.0:6007->6007/tcp, :::6007->6007/tcp dataprep-redis-server 94aaec2991d6 opea/retriever-redis:latest "python retriever_re…" 4 minutes ago Up 4 minutes 0.0.0.0:7000->7000/tcp, :::7000->7000/tcp retriever-redis-server 9fb6744ceb24 opea/llm-tgi:latest "bash entrypoint.sh" 4 minutes ago Up 4 minutes 0.0.0.0:9000->9000/tcp, :::9000->9000/tcp llm-tgi-server 27576d976a3d opea/embedding-tei:latest "python embedding_te…" 4 minutes ago Up 4 minutes 0.0.0.0:6000->6000/tcp, :::6000->6000/tcp embedding-tei-server 3e04371fd54b opea/reranking-tei:latest "python reranking_te…" 4 minutes ago Up 4 minutes 0.0.0.0:8000->8000/tcp, :::8000->8000/tcp reranking-tei-xeon-server 62929403a9ed ghcr.io/huggingface/text-generation-inference:sha-e4201f4-intel-cpu "text-generation-lau…" 5 minutes ago Up 4 minutes 0.0.0.0:9009->80/tcp, [::]:9009->80/tcp tgi-service 50208c6bc36c redis/redis-stack:7.2.0-v9 "/entrypoint.sh" 5 minutes ago Up 4 minutes 0.0.0.0:6379->6379/tcp, :::6379->6379/tcp, 0.0.0.0:8001->8001/tcp, :::8001->8001/tcp redis-vector-db 2a4158c2dbc8 ghcr.io/huggingface/text-embeddings-inference:cpu-1.5 "text-embeddings-rou…" 5 minutes ago Up 4 minutes 0.0.0.0:8808->80/tcp, [::]:8808->80/tcp tei-reranking-server 47a59e0d52de ghcr.io/huggingface/text-embeddings-inference:cpu-1.5 "text-embeddings-rou…" 5 minutes ago Up 4 minutes 0.0.0.0:6006->80/tcp, [::]:6006->80/tcp tei-embedding-server
Export host_ip
environment variable:
export host_ip=172.31.37.13
Test:
curl ${host_ip}:6006/embed \
-X POST \
-d '{"inputs":"What is Deep Learning?"}' \
-H 'Content-Type: application/json'
Answer:
[[0.00037115702,-0.06356819,0.0024758505,-0.012360337,0.050739925,0.023380278,0.022216318,0.0008076447,-0.0003412891,
. . .
-0.0067949123,0.022558564,-0.04570635,-0.033072025,0.022725677,0.016026087,-0.02125421,-0.02984927,-0.0049473033]]
Test:
curl http://${host_ip}:6000/v1/embeddings\
-X POST \
-d '{"text":"hello"}' \
-H 'Content-Type: application/json'
Answer:
{"id":"b73c50b8a8b535c3af708ebc16b0d9cd","text":"hello","embedding":[0.0007791813,0.042613804,0.020304274,-0.0070378557,0.00020632005,0.020170836,-0.00021343566,0.04560513,-0.04856186,-0.0681003
. . .
027401684,-0.052007433,0.016100302,0.059366036,-0.0044034636],"search_type":"similarity","k":4,"distance_threshold":null,"fetch_k":20,"lambda_mult":0.5,"score_threshold":0.2}
Test:
export your_embedding=$(python3 -c "import random; embedding = [random.uniform(-1, 1) for _ in range(768)]; print(embedding)")
curl http://${host_ip}:7000/v1/retrieval \
-X POST \
-d "{\"text\":\"test\",\"embedding\":${your_embedding}}" \
-H 'Content-Type: application/json'
Answer:
{"id":"4e0eb0f1ac507c4fbd8f4c843f705f78","retrieved_docs":[],"initial_query":"test","top_n":1}
Test:
curl http://${host_ip}:8808/rerank \
-X POST \
-d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}' \
-H 'Content-Type: application/json'
Answer:
[{"index":1,"score":0.94238955},{"index":0,"score":0.120219156}]
Test:
curl http://${host_ip}:8000/v1/reranking\
-X POST \
-d '{"initial_query":"What is Deep Learning?", "retrieved_docs": [{"text":"Deep Learning is not..."}, {"text":"Deep learning is..."}]}' \
-H 'Content-Type: application/json'
Answer:
{"id":"41d35cb9153ceb018d62fd1271194aa5","model":null,"query":"What is Deep Learning?","max_new_tokens":1024,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true,"chat_template":null,"documents":["Deep learning is..."]}
-
Check logs:
sudo docker logs tgi-service
It takes ~5 minutes for this service to be ready. Wait till you see this log output:
. . . 2024-09-03T17:28:42.909843Z INFO text_generation_launcher: Downloaded /data/models--Intel--neural-chat-7b-v3-3/snapshots/bdd31cf498d13782cc7497cba5896996ce429f91/pytorch_model-00002-of-00002.bin in 0:00:25. 2024-09-03T17:28:42.909864Z INFO text_generation_launcher: Download: [2/2] -- ETA: 0 2024-09-03T17:28:42.909880Z WARN text_generation_launcher: 🚨🚨BREAKING CHANGE in 2.0🚨🚨: Safetensors conversion is disabled without `--trust-remote-code` because Pickle files are unsafe and can essentially contain remote code execution!Please check for more information here: https://huggingface.co/docs/text-generation-inference/basic_tutorials/safety 2024-09-03T17:28:42.910155Z WARN text_generation_launcher: No safetensors weights found for model Intel/neural-chat-7b-v3-3 at revision None. Converting PyTorch weights to safetensors. 2024-09-03T17:30:26.694416Z INFO text_generation_launcher: Convert: [1/2] -- Took: 0:01:43.759912 2024-09-03T17:30:58.726409Z INFO text_generation_launcher: Convert: [2/2] -- Took: 0:00:32.031806 2024-09-03T17:30:59.727506Z INFO download: text_generation_launcher: Successfully downloaded weights for Intel/neural-chat-7b-v3-3 2024-09-03T17:30:59.727942Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-09-03T17:31:03.179128Z WARN text_generation_launcher: FBGEMM fp8 kernels are not installed. 2024-09-03T17:31:03.196988Z INFO text_generation_launcher: Using Attention = False 2024-09-03T17:31:03.197034Z INFO text_generation_launcher: Using Attention = paged 2024-09-03T17:31:03.251121Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm' 2024-09-03T17:31:03.410013Z INFO text_generation_launcher: affinity={0, 1, 2, 3, 4, 5, 6, 7}, membind = {0} 2024-09-03T17:31:06.539109Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0 2024-09-03T17:31:06.584728Z INFO shard-manager: text_generation_launcher: Shard ready in 6.806910922s rank=0 2024-09-03T17:31:06.633462Z INFO text_generation_launcher: Starting Webserver 2024-09-03T17:31:06.820764Z INFO text_generation_router_v3: backends/v3/src/lib.rs:90: Warming up model 2024-09-03T17:31:22.384496Z INFO text_generation_launcher: Cuda Graphs are disabled (CUDA_GRAPHS=None). 2024-09-03T17:31:22.384696Z INFO text_generation_router_v3: backends/v3/src/lib.rs:102: Setting max batch total tokens to 292528 2024-09-03T17:31:22.384724Z INFO text_generation_router_v3: backends/v3/src/lib.rs:126: Using backend V3 2024-09-03T17:31:22.384750Z INFO text_generation_router::server: router/src/server.rs:1651: Using the Hugging Face API 2024-09-03T17:31:22.384775Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token" 2024-09-03T17:31:22.832789Z INFO text_generation_router::server: router/src/server.rs:2349: Serving revision bdd31cf498d13782cc7497cba5896996ce429f91 of model Intel/neural-chat-7b-v3-3 2024-09-03T17:31:22.862858Z INFO text_generation_router::server: router/src/server.rs:1747: Overriding LlamaTokenizer with TemplateProcessing to follow python override defined in https://github.com/huggingface/transformers/blob/4aa17d00690b7f82c95bb2949ea57e22c35b4336/src/transformers/models/llama/tokenization_llama_fast.py#L203-L205 2024-09-03T17:31:22.862918Z INFO text_generation_router::server: router/src/server.rs:1781: Using config Some(Mistral) 2024-09-03T17:31:22.862944Z WARN text_generation_router::server: router/src/server.rs:1928: Invalid hostname, defaulting to 0.0.0.0 2024-09-03T17:31:22.868816Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
-
Check TGI service:
# TGI service curl http://${host_ip}:9009/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \ -H 'Content-Type: application/json'
with the response:
{"generated_text":"\n\nDeep Learning is a subset of Machine Learning based on Artificial Neural Network"}
-
Check vLLM service:
curl http://${host_ip}:9009/v1/completions \ -H "Content-Type: application/json" \ -d '{"model": "Intel/neural-chat-7b-v3-3", "prompt": "What is Deep Learning?", "max_tokens": 32, "temperature": 0}'
with the response:
{"object":"text_completion","id":"","created":1725387779,"model":"Intel/neural-chat-7b-v3-3","system_fingerprint":"2.2.1-dev0-sha-e4201f4-intel-cpu","choices":[{"index":0,"text":"\n\nDeep Learning is a subset of Machine Learning that is concerned with algorithms inspired by the structure and function of the brain. It is a part of Artificial","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":6,"completion_tokens":32,"total_tokens":38}}
Test:
curl http://${host_ip}:9000/v1/chat/completions\
-X POST \
-d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \
-H 'Content-Type: application/json'
Answer:
data: b'\n'
data: b'\n'
data: b'Deep'
data: b' learning'
data: b' is'
data: b' a'
data: b' subset'
data: b' of'
data: b' machine'
data: b' learning'
data: b' that'
data: b' uses'
data: b' algorithms'
data: b' to'
data: b' learn'
data: b' from'
data: b' data'
data: [DONE]
Test:
curl http://${host_ip}:8888/v1/chatqna -H "Content-Type: application/json" -d '{
"messages": "What is the revenue of Nike in 2023?"
}'
Answer:
data: b'\n'
data: b'\n'
data: b'N'
data: b'ike'
data: b"'"
data: b's'
data: b' revenue'
data: b' for'
. . .
data: b' popularity'
data: b' among'
data: b' consumers'
data: b'.'
data: b'</s>'
data: [DONE]
-
Ask the question:
[ec2-user@ip-172-31-77-194 ~]$ curl http://${host_ip}:8888/v1/chatqna -H "Content-Type: application/json" -d '{ "messages": "What is OPEA?" }' data: b'\n' data: b'\n' data: b'The' data: b' Oklahoma' data: b' Public' data: b' Em' data: b'ploy' data: b'ees' data: b' Association'
-
Update knowledge base:
[ec2-user@ip-172-31-77-194 ~]$ curl -X POST "http://${host_ip}:6007/v1/dataprep" \ -H "Content-Type: multipart/form-data" \ -F 'link_list=["https://opea.dev"]' {"status":200,"message":"Data preparation succeeded"}
-
Ask the question:
curl -X POST "http://${host_ip}:6007/v1/dataprep" -H "Content-Type: multipart/form-data" -F 'link_list=["https://opea.dev"]' http://${host_ip}:8888/v1/chatqna -H "Content-Type: application/json" -d '{ "messages": "What is OPEA?" }' data: b'\n' data: b'O' data: b'PE' data: b'A' data: b' stands' data: b' for' data: b' Open' data: b' Platform' data: b' for' data: b' Enterprise' data: b' AI' data: b'.'
-
Delete link from the knowledge base:
[ec2-user@ip-172-31-77-194 ~]$ # delete link curl -X POST "http://${host_ip}:6007/v1/dataprep/delete_file" \ -d '{"file_path": "https://opea.dev"}' \ -H "Content-Type: application/json" {"detail":"File https://opea.dev not found. Please check file_path."}
This is giving an error: opea-project/GenAIExamples#724
This is giving an error: opea-project/GenAIExamples#723
- Download PDF:
curl -O https://github.com/opea-project/GenAIComps/blob/main/comps/retrievers/langchain/redis/data/nke-10k-2023.pdf
- Update knowledge base:
curl -X POST "http://${host_ip}:6007/v1/dataprep" \ -H "Content-Type: multipart/form-data" \ -F "files=@./nke-10k-2023.pdf"
- Disconnect the network
sudo docker network disconnect -f ubuntu_default tgi-service sudo docker compose down