adrienbrault/llama2-mac-gpu.sh

funkytaco · 2023-07-21T06:45:57Z

@gengwg Does this look right for text-generation-webui for MacBookAir 2020 M1:

python3 server.py --listen --trust-remote-code --cpu-memory 8 --gpu-memory 8 --extensions openai --loader llamacpp --model TheBloke_Llama-2-13B-chat-GGML --notebook

I asked it where is Atlanta, and it's very, very very slow. At 290 seconds, it has responded with this so far:
Question: Where is Atlanta?
Factual answer: Atlanta is located in the state of Georgia, United States.

michalbrzezinskiorg · 2023-07-21T16:58:16Z

that is just great! can you point me where to find more info how to learn tnat model? feed it with some extra data like my ebook collection to discuss matter described there? is there some cmd simple command to pass pdf or txt file? would like to discuss avant garde cinema and some media art theory
:)

sndani · 2023-07-22T03:14:12Z

Cheers for the simple single line -help and -p "prompt here". I tested the -i hoping to get interactive chat, but it just keep talking and then just blank lines.

Still wondering how to run "chat" mode session then saving the conversation. Will check this page again later.

@enzyme69 try with -i -ins instead of -p

enzyme69 · 2023-07-22T08:10:46Z

Cheers adding -i indeed making it generating words non stop! I'll check around if there is some kind of nice chat UI for Llama. I tried gpt4All and simply loading the model, however the response seems weird.

TikkunCreation · 2023-07-22T15:20:54Z

How many tokens/sec are you all getting and what's your Mac CPU and Ram?

enzyme69 · 2023-07-22T22:13:33Z

macOS M2, 32 GB. Not sure how the "token" etc actually works. Does it vary depending on the prompt?

zhedasuiyuan · 2023-07-23T02:06:30Z

Cheers for the simple single line -help and -p "prompt here". I tested the -i hoping to get interactive chat, but it just keep talking and then just blank lines.

Still wondering how to run "chat" mode session then saving the conversation. Will check this page again later.

maybe try the following command instead
./server -m llama-2-13b-chat.ggmlv3.q4_0.bin --ctx-size 2048 --threads 10 --n-gpu-layers 1
and then go to localhost:8080

enzyme69 · 2023-07-24T10:49:19Z

Thanks @zhedasuiyuan this chat mode is what I was looking for.

reustle · 2023-07-24T15:35:53Z

Could I humbly suggest to expand all of the command line args to their full version in your script? Much easier to grok as a newcomer! Thanks for posting this.

echo "Prompt: " \
    && read PROMPT \
    && ./main \
        --threads 8 \
        --n-gpu-layers 1 \
        --model ${MODEL} \
        --color \
        --ctx-size 2048 \
        --temp 0.7 \
        --repeat_penalty 1.1 \
        --n-predict -1 \
        --prompt "[INST] ${PROMPT} [/INST]"

And for those curious, ./main --help is... helpful!

adrienbrault · 2023-07-24T16:00:13Z

Could I humbly suggest to expand all of the command line args to their full version in your script? Much easier to grok as a newcomer! Thanks for posting this.
echo "Prompt: " \
    && read PROMPT \
    && ./main \
        --threads 8 \
        --n-gpu-layers 1 \
        --model ${MODEL} \
        --color \
        --ctx-size 2048 \
        --temp 0.7 \
        --repeat_penalty 1.1 \
        --n-predict -1 \
        --prompt "[INST] ${PROMPT} [/INST]"
And for those curious, ./main --help is... helpful!

@reustle Good idea! Updated, thanks

rjalexa · 2023-07-24T17:19:35Z

Running a Macbook Pro M2 with 32GB and wish to ask about entities in news article. From the following page:
I am using the following lines in this gist script:

export MODEL=llama-2-13b.ggmlv3.q4_0.bin
wget "https://huggingface.co/TheBloke/Llama-2-13B-GGML/resolve/main/${MODEL}"

It this the right way to use 13B non chat? It seems to work but hallucinates quite a lot.

enzyme69 · 2023-07-25T02:26:59Z

While having simple chat, I got segmentation fault, what happened? How to prevent this?

junhochoi · 2023-07-25T08:21:11Z

https://github.com/ggerganov/llama.cpp/blob/master/examples/llama2.sh is a good example script added recently.

ksingh7 · 2023-07-25T20:43:36Z

This is great.

Any suggestions to serve this as an API endpoint locally and then use it with a chat-ui ?

AmoghM · 2023-07-27T20:27:07Z

I don't think llama is using the GPU.

Ran the step:
LLAMA_METAL=1 make

then

make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "Tell me something about Barack Obama" -n 512 -ngl 1

but the activity monitor shows CPU being only used:

This is the response when I run again LLAMA_METAL=1 make:

I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.

Please advise.

adrienbrault · 2023-07-27T20:47:59Z

@AmoghM Try make clean && LLAMA_METAL=1 make and then run ./main ... again

AmoghM · 2023-07-28T15:03:46Z

@AmoghM Try make clean && LLAMA_METAL=1 make and then run ./main ... again

@adrienbrault Thanks, that worked!

BoKa33 · 2023-08-18T08:34:57Z

Nice work!

And it can be used by simply calling the bash examples/chat-13B.sh at the last step.

Besides, is there a way to download the 70B model and 70B-chat model? Thanks!

Yes, "The Bloke" published them on hugging face: https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML

I recomment not downloading via browser, use j downloader or anything like this instead. Maybe even commandline tools are better when it comes to downloading files of that size.

sujantkumarkv · 2023-08-20T15:21:28Z

I recomment not downloading via browser, use j downloader or anything like this instead. Maybe even commandline tools are better when it comes to downloading files of that size.

here its using wget using the commandline & not via the huggingface browser ui. so its all good right, or did I not get your point?

sujantkumarkv · 2023-08-20T15:25:39Z

everyone, my need is to generate embeddings with llama2:
the examples/embedding/embedding.cpp list the 2048 token limit:

if (params.n_ctx > 2048) {
        fprintf(stderr, "%s: warning: model might not support context sizes greater than 2048 tokens (%d specified);"
                "expect poor results\n", __func__, params.n_ctx);
    }

but the llama2 has 4096 context length, on building it, we get embedding file just like the main file & more, so i was not sure if we need to edit that. to 4096?

any help is really appreciated. thanks.

danielabar · 2023-09-02T19:05:40Z

Getting the following error loading model:

main: build = 1154 (3358c38)
main: seed  = 1693681287
gguf_init_from_file: invalid magic number 67676a74
error loading model: llama_model_loader: failed to load model from llama-2-13b-chat.ggmlv3.q4_0.bin

llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'llama-2-13b-chat.ggmlv3.q4_0.bin'

Does anyone know how to fix this?

brobles82 · 2023-09-02T20:20:31Z

Same issue :(

cfmbrand · 2023-09-03T19:29:36Z

Getting the following error loading model:

main: build = 1154 (3358c38)
main: seed  = 1693681287
gguf_init_from_file: invalid magic number 67676a74
error loading model: llama_model_loader: failed to load model from llama-2-13b-chat.ggmlv3.q4_0.bin

llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'llama-2-13b-chat.ggmlv3.q4_0.bin'

Does anyone know how to fix this?

Same issue here also! Did something change? I'm a noob, so no idea what a magic number is.

danielabar · 2023-09-03T20:57:46Z

There's a similar error reported in the Python bindings for llama.cpp. Sounds like need to wait for a new model format to be available. In the meantime, a temporary workaround is to checkout an older release of llama.cpp, for example:

git checkout 1aa18ef

Which is for this release from Jul 25.

Then run the build again.

smart-patrol · 2023-09-15T20:52:08Z

Thanks for above.

I was running into an error:

error loading model: failed to open --color: No such file or directory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '--color'
main: error: unable to load model

Deleted everything and then ran:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git reset --hard 1aa18ef

Then ran the rest of gist and it worked again.

neoneye · 2023-09-23T21:28:31Z

Yeah, latest llama.cpp is no longer compatible with GGML models. The new model format, GGUF, was merged recently. As far as llama.cpp is concerned, GGML is now dead

https://huggingface.co/TheBloke/vicuna-13B-v1.5-16K-GGML/discussions/6#64e5ba63a9a5eabaa6fd4a04

Replacing the GGML model with a GGUF model
https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q8_0.gguf

You can check if it works:

PROMPT> ./main -m models/llama-2-7b-chat.Q8_0.gguf --random-prompt
snip lots of info
response to the prompt
After years of hard work and dedication, a high school teacher in Texas has been recognized for her outstanding contributions to education.
Ms. Rodriguez, a mathematics teacher at...

data-octo · 2023-09-28T11:38:02Z

While having simple chat, I got segmentation fault, what happened? How to prevent this?

How is chat ui implemented? Thanks!

ap247 · 2023-11-07T04:50:00Z

Does anybody know how to adjust the prompt input to include multiple lines of input before submitting the prompt?

therumham · 2023-11-21T17:12:41Z

Thanks for above.

I was running into an error:

error loading model: failed to open --color: No such file or directory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '--color'
main: error: unable to load model

Deleted everything and then ran:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git reset --hard 1aa18ef

Then ran the rest of gist and it worked again.

Seems to have worked once but now continues to fail. Any ideas why @smart-patrol ?

Prompt: 
How large is the sun?
main: build = 904 (1aa18ef)
main: seed  = 1700587479
error loading model: failed to open --color: No such file or directory
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '--color'
main: error: unable to load model

bhadreshvk · 2023-12-10T08:01:30Z

same issue

	# Clone llama.cpp
	git clone https://github.com/ggerganov/llama.cpp.git
	cd llama.cpp

	# Build it
	make clean
	LLAMA_METAL=1 make

	# Download model
	export MODEL=llama-2-13b-chat.ggmlv3.q4_0.bin
	wget "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${MODEL}"

	# Run
	echo "Prompt: " \
	&& read PROMPT \
	&& ./main \
	--threads 8 \
	--n-gpu-layers 1 \
	--model ${MODEL} \
	--color \
	--ctx-size 2048 \
	--temp 0.7 \
	--repeat_penalty 1.1 \
	--n-predict -1 \
	--prompt "[INST] ${PROMPT} [/INST]"

adrienbrault/llama2-mac-gpu.sh

funkytaco commented Jul 21, 2023

Uh oh!

michalbrzezinskiorg commented Jul 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sndani commented Jul 22, 2023

Uh oh!

enzyme69 commented Jul 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TikkunCreation commented Jul 22, 2023

Uh oh!

enzyme69 commented Jul 22, 2023

Uh oh!

zhedasuiyuan commented Jul 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enzyme69 commented Jul 24, 2023

Uh oh!

reustle commented Jul 24, 2023

Uh oh!

adrienbrault commented Jul 24, 2023

Uh oh!

rjalexa commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enzyme69 commented Jul 25, 2023

Uh oh!

junhochoi commented Jul 25, 2023

Uh oh!

ksingh7 commented Jul 25, 2023

Uh oh!

AmoghM commented Jul 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrienbrault commented Jul 27, 2023

Uh oh!

AmoghM commented Jul 28, 2023

Uh oh!

BoKa33 commented Aug 18, 2023

Uh oh!

sujantkumarkv commented Aug 20, 2023

Uh oh!

sujantkumarkv commented Aug 20, 2023

Uh oh!

danielabar commented Sep 2, 2023

Uh oh!

brobles82 commented Sep 2, 2023

Uh oh!

cfmbrand commented Sep 3, 2023

Uh oh!

danielabar commented Sep 3, 2023

Uh oh!

smart-patrol commented Sep 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neoneye commented Sep 23, 2023

Uh oh!

data-octo commented Sep 28, 2023

Uh oh!

ap247 commented Nov 7, 2023

Uh oh!

therumham commented Nov 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhadreshvk commented Dec 10, 2023

Uh oh!

michalbrzezinskiorg commented Jul 21, 2023 •

edited

Loading

enzyme69 commented Jul 22, 2023 •

edited

Loading

zhedasuiyuan commented Jul 23, 2023 •

edited

Loading

rjalexa commented Jul 24, 2023 •

edited

Loading

AmoghM commented Jul 27, 2023 •

edited

Loading

smart-patrol commented Sep 15, 2023 •

edited

Loading

therumham commented Nov 21, 2023 •

edited

Loading