OPEA on AWS using Docker Compose

Pick your AMI

Ubuntu

Launch an Ubuntu 24.04, m7i.4xlarge instance (16 vCPU, 64 GB memory). Change storage to 500GB.

Install Docker:

# Add Docker's official GPG key:
sudo apt-get -y update
sudo apt-get -y install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get -y update
sudo apt-get -y install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Amazon Linux 2023

Using AWS Console, Launch an Amazon Linux 2023, m7i.4xlarge instance (16 vCPU, 64 GB memory). Change storage to 500GB.

Install Docker:

sudo yum update -y
sudo yum install -y docker
sudo service docker start
#sudo usermod -a -G docker ec2-user

Install Docker Compose:

sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

Docker images

Pull OPEA Docker images:

sudo docker pull opea/chatqna:latest
sudo docker pull opea/chatqna-conversation-ui:latest

Replace HuggingFace API token and private IP address of the host below and copy the contents in a file named .env:

host_ip=172.31.37.13 #private IP address of the host
no_proxy=${host_ip}
HUGGINGFACEHUB_API_TOKEN="Your_Huggingface_API_Token"
EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
RERANK_MODEL_ID="BAAI/bge-reranker-base"
LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
TEI_EMBEDDING_ENDPOINT="http://${host_ip}:6006"
TEI_RERANKING_ENDPOINT="http://${host_ip}:8808"
TGI_LLM_ENDPOINT="http://${host_ip}:9009"
REDIS_URL="redis://${host_ip}:6379"
INDEX_NAME="rag-redis"
REDIS_HOST=${host_ip}
MEGA_SERVICE_HOST_IP=${host_ip}
EMBEDDING_SERVICE_HOST_IP=${host_ip}
RETRIEVER_SERVICE_HOST_IP=${host_ip}
RERANK_SERVICE_HOST_IP=${host_ip}
LLM_SERVICE_HOST_IP=${host_ip}
BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/chatqna"
DATAPREP_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/dataprep"
DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_file"
DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_file"

Download Docker Compose file:

curl -O https://raw.githubusercontent.com/opea-project/GenAIExamples/main/ChatQnA/docker_compose/intel/cpu/xeon/compose.yaml

Start the application:

sudo docker compose -f compose.yaml up -d

Verify the list of containers:

ubuntu@ip-172-31-79-111:~$ sudo docker container ls
CONTAINER ID   IMAGE                                                                 COMMAND                  CREATED         STATUS         PORTS                                                                                  NAMES
29f3a466d175   opea/chatqna-ui:latest                                                "docker-entrypoint.s…"   4 minutes ago   Up 4 minutes   0.0.0.0:5173->5173/tcp, :::5173->5173/tcp                                              chatqna-xeon-ui-server
1020fa2a75c2   opea/chatqna:latest                                                   "python chatqna.py"      4 minutes ago   Up 4 minutes   0.0.0.0:8888->8888/tcp, :::8888->8888/tcp                                              chatqna-xeon-backend-server
02112b28ee54   opea/dataprep-redis:latest                                            "python prepare_doc_…"   4 minutes ago   Up 4 minutes   0.0.0.0:6007->6007/tcp, :::6007->6007/tcp                                              dataprep-redis-server
94aaec2991d6   opea/retriever-redis:latest                                           "python retriever_re…"   4 minutes ago   Up 4 minutes   0.0.0.0:7000->7000/tcp, :::7000->7000/tcp                                              retriever-redis-server
9fb6744ceb24   opea/llm-tgi:latest                                                   "bash entrypoint.sh"     4 minutes ago   Up 4 minutes   0.0.0.0:9000->9000/tcp, :::9000->9000/tcp                                              llm-tgi-server
27576d976a3d   opea/embedding-tei:latest                                             "python embedding_te…"   4 minutes ago   Up 4 minutes   0.0.0.0:6000->6000/tcp, :::6000->6000/tcp                                              embedding-tei-server
3e04371fd54b   opea/reranking-tei:latest                                             "python reranking_te…"   4 minutes ago   Up 4 minutes   0.0.0.0:8000->8000/tcp, :::8000->8000/tcp                                              reranking-tei-xeon-server
62929403a9ed   ghcr.io/huggingface/text-generation-inference:sha-e4201f4-intel-cpu   "text-generation-lau…"   5 minutes ago   Up 4 minutes   0.0.0.0:9009->80/tcp, [::]:9009->80/tcp                                                tgi-service
50208c6bc36c   redis/redis-stack:7.2.0-v9                                            "/entrypoint.sh"         5 minutes ago   Up 4 minutes   0.0.0.0:6379->6379/tcp, :::6379->6379/tcp, 0.0.0.0:8001->8001/tcp, :::8001->8001/tcp   redis-vector-db
2a4158c2dbc8   ghcr.io/huggingface/text-embeddings-inference:cpu-1.5                 "text-embeddings-rou…"   5 minutes ago   Up 4 minutes   0.0.0.0:8808->80/tcp, [::]:8808->80/tcp                                                tei-reranking-server
47a59e0d52de   ghcr.io/huggingface/text-embeddings-inference:cpu-1.5                 "text-embeddings-rou…"   5 minutes ago   Up 4 minutes   0.0.0.0:6006->80/tcp, [::]:6006->80/tcp                                                tei-embedding-server

Validate Services

Export host_ip environment variable:

export host_ip=172.31.37.13

Embedding service

Test:

curl ${host_ip}:6006/embed \
  -X POST \
  -d '{"inputs":"What is Deep Learning?"}' \
  -H 'Content-Type: application/json'

Answer:

[[0.00037115702,-0.06356819,0.0024758505,-0.012360337,0.050739925,0.023380278,0.022216318,0.0008076447,-0.0003412891,
  . . . 
-0.0067949123,0.022558564,-0.04570635,-0.033072025,0.022725677,0.016026087,-0.02125421,-0.02984927,-0.0049473033]]

Embedding microservice

Test:

curl http://${host_ip}:6000/v1/embeddings\
  -X POST \
  -d '{"text":"hello"}' \
  -H 'Content-Type: application/json'

Answer:

{"id":"b73c50b8a8b535c3af708ebc16b0d9cd","text":"hello","embedding":[0.0007791813,0.042613804,0.020304274,-0.0070378557,0.00020632005,0.020170836,-0.00021343566,0.04560513,-0.04856186,-0.0681003
. . .
027401684,-0.052007433,0.016100302,0.059366036,-0.0044034636],"search_type":"similarity","k":4,"distance_threshold":null,"fetch_k":20,"lambda_mult":0.5,"score_threshold":0.2}

Retriever microservice

Test:

export your_embedding=$(python3 -c "import random; embedding = [random.uniform(-1, 1) for _ in range(768)]; print(embedding)")
curl http://${host_ip}:7000/v1/retrieval \
  -X POST \
  -d "{\"text\":\"test\",\"embedding\":${your_embedding}}" \
  -H 'Content-Type: application/json'

Answer:

{"id":"4e0eb0f1ac507c4fbd8f4c843f705f78","retrieved_docs":[],"initial_query":"test","top_n":1}

TEI Reranking service

Test:

curl http://${host_ip}:8808/rerank \
    -X POST \
    -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."]}' \
    -H 'Content-Type: application/json'

Answer:

[{"index":1,"score":0.94238955},{"index":0,"score":0.120219156}]

Reranking microservice

Test:

curl http://${host_ip}:8000/v1/reranking\
  -X POST \
  -d '{"initial_query":"What is Deep Learning?", "retrieved_docs": [{"text":"Deep Learning is not..."}, {"text":"Deep learning is..."}]}' \
  -H 'Content-Type: application/json'

Answer:

{"id":"41d35cb9153ceb018d62fd1271194aa5","model":null,"query":"What is Deep Learning?","max_new_tokens":1024,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true,"chat_template":null,"documents":["Deep learning is..."]}

LLM Backend Service

Check logs:

sudo docker logs tgi-service

It takes ~5 minutes for this service to be ready. Wait till you see this log output:

. . .
2024-09-03T17:28:42.909843Z  INFO text_generation_launcher: Downloaded /data/models--Intel--neural-chat-7b-v3-3/snapshots/bdd31cf498d13782cc7497cba5896996ce429f91/pytorch_model-00002-of-00002.bin in 0:00:25.
2024-09-03T17:28:42.909864Z  INFO text_generation_launcher: Download: [2/2] -- ETA: 0
2024-09-03T17:28:42.909880Z  WARN text_generation_launcher: 🚨🚨BREAKING CHANGE in 2.0🚨🚨: Safetensors conversion is disabled without `--trust-remote-code` because Pickle files are unsafe and can essentially contain remote code execution!Please check for more information here: https://huggingface.co/docs/text-generation-inference/basic_tutorials/safety
2024-09-03T17:28:42.910155Z  WARN text_generation_launcher: No safetensors weights found for model Intel/neural-chat-7b-v3-3 at revision None. Converting PyTorch weights to safetensors.
2024-09-03T17:30:26.694416Z  INFO text_generation_launcher: Convert: [1/2] -- Took: 0:01:43.759912
2024-09-03T17:30:58.726409Z  INFO text_generation_launcher: Convert: [2/2] -- Took: 0:00:32.031806
2024-09-03T17:30:59.727506Z  INFO download: text_generation_launcher: Successfully downloaded weights for Intel/neural-chat-7b-v3-3
2024-09-03T17:30:59.727942Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-09-03T17:31:03.179128Z  WARN text_generation_launcher: FBGEMM fp8 kernels are not installed.
2024-09-03T17:31:03.196988Z  INFO text_generation_launcher: Using Attention = False
2024-09-03T17:31:03.197034Z  INFO text_generation_launcher: Using Attention = paged
2024-09-03T17:31:03.251121Z  WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'
2024-09-03T17:31:03.410013Z  INFO text_generation_launcher: affinity={0, 1, 2, 3, 4, 5, 6, 7}, membind = {0}
2024-09-03T17:31:06.539109Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-09-03T17:31:06.584728Z  INFO shard-manager: text_generation_launcher: Shard ready in 6.806910922s rank=0
2024-09-03T17:31:06.633462Z  INFO text_generation_launcher: Starting Webserver
2024-09-03T17:31:06.820764Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:90: Warming up model
2024-09-03T17:31:22.384496Z  INFO text_generation_launcher: Cuda Graphs are disabled (CUDA_GRAPHS=None).
2024-09-03T17:31:22.384696Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:102: Setting max batch total tokens to 292528
2024-09-03T17:31:22.384724Z  INFO text_generation_router_v3: backends/v3/src/lib.rs:126: Using backend V3
2024-09-03T17:31:22.384750Z  INFO text_generation_router::server: router/src/server.rs:1651: Using the Hugging Face API
2024-09-03T17:31:22.384775Z  INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"    
2024-09-03T17:31:22.832789Z  INFO text_generation_router::server: router/src/server.rs:2349: Serving revision bdd31cf498d13782cc7497cba5896996ce429f91 of model Intel/neural-chat-7b-v3-3
2024-09-03T17:31:22.862858Z  INFO text_generation_router::server: router/src/server.rs:1747: Overriding LlamaTokenizer with TemplateProcessing to follow python override defined in https://github.com/huggingface/transformers/blob/4aa17d00690b7f82c95bb2949ea57e22c35b4336/src/transformers/models/llama/tokenization_llama_fast.py#L203-L205
2024-09-03T17:31:22.862918Z  INFO text_generation_router::server: router/src/server.rs:1781: Using config Some(Mistral)
2024-09-03T17:31:22.862944Z  WARN text_generation_router::server: router/src/server.rs:1928: Invalid hostname, defaulting to 0.0.0.0
2024-09-03T17:31:22.868816Z  INFO text_generation_router::server: router/src/server.rs:2311: Connected

Check TGI service:

# TGI service
curl http://${host_ip}:9009/generate \
  -X POST \
  -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
  -H 'Content-Type: application/json'

with the response:

{"generated_text":"\n\nDeep Learning is a subset of Machine Learning based on Artificial Neural Network"}

Check vLLM service:

curl http://${host_ip}:9009/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Intel/neural-chat-7b-v3-3", "prompt": "What is Deep Learning?", "max_tokens": 32, "temperature": 0}'

with the response:

{"object":"text_completion","id":"","created":1725387779,"model":"Intel/neural-chat-7b-v3-3","system_fingerprint":"2.2.1-dev0-sha-e4201f4-intel-cpu","choices":[{"index":0,"text":"\n\nDeep Learning is a subset of Machine Learning that is concerned with algorithms inspired by the structure and function of the brain. It is a part of Artificial","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":6,"completion_tokens":32,"total_tokens":38}}

LLM microservice

Test:

curl http://${host_ip}:9000/v1/chat/completions\
  -X POST \
  -d '{"query":"What is Deep Learning?","max_new_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \
  -H 'Content-Type: application/json'

Answer:

data: b'\n'

data: b'\n'

data: b'Deep'

data: b' learning'

data: b' is'

data: b' a'

data: b' subset'

data: b' of'

data: b' machine'

data: b' learning'

data: b' that'

data: b' uses'

data: b' algorithms'

data: b' to'

data: b' learn'

data: b' from'

data: b' data'

data: [DONE]

Megaservice

Test:

curl http://${host_ip}:8888/v1/chatqna -H "Content-Type: application/json" -d '{
     "messages": "What is the revenue of Nike in 2023?"
   }'

Answer:

data: b'\n'

data: b'\n'

data: b'N'

data: b'ike'

data: b"'"

data: b's'

data: b' revenue'

data: b' for'

. . .

data: b' popularity'

data: b' among'

data: b' consumers'

data: b'.'

data: b'</s>'

data: [DONE]

Let's run!

RAG using hyperlink

Ask the question:

[ec2-user@ip-172-31-77-194 ~]$ curl http://${host_ip}:8888/v1/chatqna -H "Content-Type: application/json" -d '{
   "messages": "What is OPEA?"
 }'
data: b'\n'

data: b'\n'

data: b'The'

data: b' Oklahoma'

data: b' Public'

data: b' Em'

data: b'ploy'

data: b'ees'

data: b' Association'

Update knowledge base:

[ec2-user@ip-172-31-77-194 ~]$ curl -X POST "http://${host_ip}:6007/v1/dataprep" \
     -H "Content-Type: multipart/form-data" \
     -F 'link_list=["https://opea.dev"]'
{"status":200,"message":"Data preparation succeeded"}

Ask the question:

curl -X POST "http://${host_ip}:6007/v1/dataprep"      -H "Content-Type: multipart/form-data"      -F 'link_list=["https://opea.dev"]'   http://${host_ip}:8888/v1/chatqna -H "Content-Type: application/json" -d '{
     "messages": "What is OPEA?"
 }'
data: b'\n'

data: b'O'

data: b'PE'

data: b'A'

data: b' stands'

data: b' for'

data: b' Open'

data: b' Platform'

data: b' for'

data: b' Enterprise'

data: b' AI'

data: b'.'

Delete link from the knowledge base:

[ec2-user@ip-172-31-77-194 ~]$ # delete link
curl -X POST "http://${host_ip}:6007/v1/dataprep/delete_file" \
     -d '{"file_path": "https://opea.dev"}' \
     -H "Content-Type: application/json"
{"detail":"File https://opea.dev not found. Please check file_path."}

This is giving an error: opea-project/GenAIExamples#724

RAG using PDF

This is giving an error: opea-project/GenAIExamples#723

Download PDF:

curl -O https://github.com/opea-project/GenAIComps/blob/main/comps/retrievers/langchain/redis/data/nke-10k-2023.pdf

Update knowledge base:

curl -X POST "http://${host_ip}:6007/v1/dataprep" \
   -H "Content-Type: multipart/form-data" \
   -F "files=@./nke-10k-2023.pdf"

Debugging Tips

Disconnect the network

sudo docker network disconnect -f ubuntu_default tgi-service
sudo docker compose down

arun-gupta/readme.md

OPEA on AWS using Docker Compose

Pick your AMI

Ubuntu

Amazon Linux 2023

Docker images

Validate Services

Embedding service

Embedding microservice

Retriever microservice

TEI Reranking service

Reranking microservice

LLM Backend Service

LLM microservice

Megaservice

Let's run!

RAG using hyperlink

RAG using PDF

Debugging Tips