Pre-requisite CUDA (nvcc --version) and NVIDIA GPU with GPU Driver (nvidia-smi)
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp
$ make LLAMA_CUBLAS=1
See also https://stackoverflow.com/a/78165019/429476 regarding g++ errors
Note: - What is GGUF ? https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/
From https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
Select the model you want to run https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF#provided-files
I an using a meidum sized model
Name | Quant method | Bits | Size | Max RAM required | Use case |
---|---|---|---|---|---|
mistral-7b-instruct-v0.2.Q4_K_M.gguf | Q4_K_M | 4 | 4.37 GB | 6.87 GB | medium, balanced quality - recommended |
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
cd Mistral-7B-Instruct-v0.2-GGUF
ls
# select and pull whichever model you are intrested in
git lfs pull --include mistral-7b-instruct-v0.2.Q4_K_M.gguf
ssd/llama.cpp$ ./main -m ~/coding/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf --color --ctx_size 2048 -n -1 -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1.1 -t 8 -ngl 10000 --multiline-input
Note I am using with --multiline-input
option
Thats it - Note that it is using about 4 GB GPU Ram
nvidia-smi
Fri Mar 15 14:26:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 ... Off | 00000000:01:00.0 On | N/A |
| N/A 52C P8 13W / 80W | 4932MiB / 6144MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 5901 G /usr/lib/xorg/Xorg 210MiB |
| 0 N/A N/A 6070 G /usr/bin/gnome-shell 124MiB |
| 0 N/A N/A 15150 G ...rker,SpareRendererForSitePerProcess 24MiB |
| 0 N/A N/A 39497 C ./main 4560MiB |
+---------------------------------------------------------------------------------------+
./server -m /home/alex/coding/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf --ctx_size 2048 -n -1 -b 256 -t 8 -ngl 10000 --host localhost --port 8080
The server can be accesed at
Thanks to post https://www.xzh.me/2023/09/serving-llama-2-7b-using-llamacpp-with.html
The above server binding is not OpenAI compatible. Since we need to be open AI compatible for Autogen we will install the python binding for llama.cpp whose server is OpenAI compatibe;
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python # for cuBLAS (BLAS via CUDA)
I have gcc different versions and had to use to overcome the DSO error
CC=gcc-12 CXX=g++-12 CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
Install other dependecies it asks via pip (pydantic,starlet etc)
python3 -m llama_cpp.server --model /home/alex/coding/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf --host localhost --port 8080
Install AutoGen
Write a small AutoGen script
import autogen
config_list = [
{
"model":"mistral-instruct-v0.2",
"base_url": "http://localhost:8080/v1",
"api_key":"NULL"
},
]
llm_config = {
"cache_seed": 442, # seed for caching and reproducibility
"config_list": config_list, # a list of OpenAI API configurations
"temperature": 0, # temperature for sampling
#"timeout": 600,
}
# create an AssistantAgent named "assistant" via OpenAI
assistant = autogen.AssistantAgent(
name="assistant",
llm_config=llm_config,
)
# create a UserProxyAgent instance named "user_proxy"
user_proxy = autogen.UserProxyAgent(
name="user_proxy",
human_input_mode="TERMINATE",
max_consecutive_auto_reply=10,
is_termination_msg=lambda x: x.get("content", "").rstrip().endswith("TERMINATE"),
code_execution_config={
"work_dir": "web",
"use_docker": False,
},
llm_config=llm_config,
system_message="""Reply TERMINATE if the task has been solved at full satisfaction.
Otherwise, reply CONTINUE, or the reason why the task is not solved yet.""",
)
# the assistant receives a message from the user_proxy, which contains the task description
user_proxy.initiate_chat(
#assistant,
assistant,
message="Check the web for latest financial news and summarise the chance of US Fed raising interest rates this month",
)
Output is happening but Autogen is not using tools like web-browser but rather relying on the LLM data. Need to check
assistant (to user_proxy):
According to various financial news outlets, including The Wall Street Journal, Reuters, and Bloomberg, there is a strong likelihood that the US Federal Reserve (Fed) will raise interest rates by 0.25 percentage points at its two-day policy meeting ending on March 16, 2023. This would mark the first rate hike since 2018.
The consensus among economists and market analysts is that the Fed will increase rates in response to rising inflation, which reached 6.0% in February, well above the central bank's 2% target. Additionally, the labor market remains tight, with the unemployment rate at 3.6%, near a 50-year low.
However, some analysts have suggested that the Fed might hold off on a rate hike if inflation data for March comes in weaker than expected. Nevertheless, most experts believe that the Fed will begin its rate hike cycle this month to keep inflation in check and maintain its credibility.
It's important to note that the Fed's decision will depend on various factors, including economic data releases, geopolitical developments, and market conditions. Therefore, while the odds of a rate hike are high, they are not 100%, and the final decision will be announced on March 16.
--------------------------------------------------------------------------------
>>>>>>>> USING AUTO REPLY...
please check the latuser_proxy (to assistant):
In summary, based on current information from reputable financial news sources, there is a high probability that the US Federal Reserve will raise interest rates by 0.25 percentage points at its upcoming policy meeting on March 16, 2023. This decision is largely due to rising inflation and a tight labor market. However, some analysts suggest that a weaker-than-expected inflation reading in March could cause the Fed to hold off on a rate hike. Ultimately, the final decision will depend on various economic and geopolitical factors.