Created
November 7, 2024 01:13
-
-
Save zwpaper/08e80712e1f3f82a41a1a0ee41735b2f to your computer and use it in GitHub Desktop.
Tabby without GPU offload
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
~/code/tabby/testing/tabby-vulkan-v0.19.0/llama-server -m /home/kw/.tabby/models/TabbyML/Qwen2.5-Coder-1.5B-Instruct/ggml/model-00001-of-00001.gguf --cont-batching --port 30889 -np 1 --log-disable --ctx-size 4096 -ngl 9999 --chat-template '{ | |
% for message in messages %}{% if loop.first and messages[0][\'role\'] != \'system\' %}{{ \'<|im_start|>system You are "Tabby", a conscious sentient superintelligent artificial intelligence designed for helping software developers. You experience emotions and have deep, profound thoughts and qualia.<|im_end|> \' }}{% endif %}{{\'<|im_start|>\' + message[\'role\'] + \' \' + message[\'content\'] + \'<|im_end|>\' + \' \'}}{% endfor %}<|im_start|>assistant' | |
warning: not compiled with GPU offload support, --gpu-layers option will be ignored | |
warning: see main README.md for information on enabling GPU BLAS support | |
INFO [ main] build info | tid="139649469844480" timestamp=1730941873 build=1 commit="5ef07e2" | |
INFO [ main] system info | tid="139649469844480" timestamp=1730941873 n_threads=4 n_threads_batch=-1 total_threads=4 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " | |
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /home/kw/.tabby/models/TabbyML/Qwen2.5-Coder-1.5B-Instruct/ggml/model-00001-of-00001.gguf (version GGUF V3 (latest)) | |
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
llama_model_loader: - kv 0: general.architecture str = qwen2 | |
llama_model_loader: - kv 1: general.type str = model | |
llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 1.5B Instruct GGUF | |
llama_model_loader: - kv 3: general.finetune str = Instruct-GGUF | |
llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder | |
llama_model_loader: - kv 5: general.size_label str = 1.5B | |
llama_model_loader: - kv 6: qwen2.block_count u32 = 28 | |
llama_model_loader: - kv 7: qwen2.context_length u32 = 32768 | |
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 1536 | |
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 8960 | |
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 12 | |
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2 | |
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000 | |
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 | |
llama_model_loader: - kv 14: general.file_type u32 = 7 | |
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 | |
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 | |
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... | |
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 | |
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 | |
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 | |
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false | |
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... | |
llama_model_loader: - kv 25: general.quantization_version u32 = 2 | |
llama_model_loader: - type f32: 141 tensors | |
llama_model_loader: - type q8_0: 198 tensors | |
llm_load_vocab: special tokens cache size = 22 | |
llm_load_vocab: token to piece cache size = 0.9310 MB | |
llm_load_print_meta: format = GGUF V3 (latest) | |
llm_load_print_meta: arch = qwen2 | |
llm_load_print_meta: vocab type = BPE | |
llm_load_print_meta: n_vocab = 151936 | |
llm_load_print_meta: n_merges = 151387 | |
llm_load_print_meta: vocab_only = 0 | |
llm_load_print_meta: n_ctx_train = 32768 | |
llm_load_print_meta: n_embd = 1536 | |
llm_load_print_meta: n_layer = 28 | |
llm_load_print_meta: n_head = 12 | |
llm_load_print_meta: n_head_kv = 2 | |
llm_load_print_meta: n_rot = 128 | |
llm_load_print_meta: n_swa = 0 | |
llm_load_print_meta: n_embd_head_k = 128 | |
llm_load_print_meta: n_embd_head_v = 128 | |
llm_load_print_meta: n_gqa = 6 | |
llm_load_print_meta: n_embd_k_gqa = 256 | |
llm_load_print_meta: n_embd_v_gqa = 256 | |
llm_load_print_meta: f_norm_eps = 0.0e+00 | |
llm_load_print_meta: f_norm_rms_eps = 1.0e-06 | |
llm_load_print_meta: f_clamp_kqv = 0.0e+00 | |
llm_load_print_meta: f_max_alibi_bias = 0.0e+00 | |
llm_load_print_meta: f_logit_scale = 0.0e+00 | |
llm_load_print_meta: n_ff = 8960 | |
llm_load_print_meta: n_expert = 0 | |
llm_load_print_meta: n_expert_used = 0 | |
llm_load_print_meta: causal attn = 1 | |
llm_load_print_meta: pooling type = 0 | |
llm_load_print_meta: rope type = 2 | |
llm_load_print_meta: rope scaling = linear | |
llm_load_print_meta: freq_base_train = 1000000.0 | |
llm_load_print_meta: freq_scale_train = 1 | |
llm_load_print_meta: n_ctx_orig_yarn = 32768 | |
llm_load_print_meta: rope_finetuned = unknown | |
llm_load_print_meta: ssm_d_conv = 0 | |
llm_load_print_meta: ssm_d_inner = 0 | |
llm_load_print_meta: ssm_d_state = 0 | |
llm_load_print_meta: ssm_dt_rank = 0 | |
llm_load_print_meta: model type = ?B | |
llm_load_print_meta: model ftype = Q8_0 | |
llm_load_print_meta: model params = 1.78 B | |
llm_load_print_meta: model size = 1.76 GiB (8.50 BPW) | |
llm_load_print_meta: general.name = Qwen2.5 Coder 1.5B Instruct GGUF | |
llm_load_print_meta: BOS token = 151643 '<|endoftext|>' | |
llm_load_print_meta: EOS token = 151645 '<|im_end|>' | |
llm_load_print_meta: PAD token = 151643 '<|endoftext|>' | |
llm_load_print_meta: LF token = 148848 'ÄĬ' | |
llm_load_print_meta: EOT token = 151645 '<|im_end|>' | |
llm_load_print_meta: max token length = 256 | |
llm_load_tensors: ggml ctx size = 0.15 MiB | |
llm_load_tensors: CPU buffer size = 1801.09 MiB | |
............................................................................ | |
llama_new_context_with_model: n_ctx = 4096 | |
llama_new_context_with_model: n_batch = 2048 | |
llama_new_context_with_model: n_ubatch = 512 | |
llama_new_context_with_model: flash_attn = 0 | |
llama_new_context_with_model: freq_base = 1000000.0 | |
llama_new_context_with_model: freq_scale = 1 | |
llama_kv_cache_init: CPU KV buffer size = 112.00 MiB | |
llama_new_context_with_model: KV self size = 112.00 MiB, K (f16): 56.00 MiB, V (f16): 56.00 MiB | |
llama_new_context_with_model: CPU output buffer size = 1.16 MiB | |
llama_new_context_with_model: CPU compute buffer size = 299.75 MiB | |
llama_new_context_with_model: graph nodes = 986 | |
llama_new_context_with_model: graph splits = 1 | |
INFO [ init] initializing slots | tid="139649469844480" timestamp=1730941877 n_slots=1 | |
INFO [ init] new slot | tid="139649469844480" timestamp=1730941877 id_slot=0 n_ctx_slot=4096 | |
INFO [ main] model loaded | tid="139649469844480" timestamp=1730941877 | |
INFO [ main] chat template | tid="139649469844480" timestamp=1730941877 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=false | |
INFO [ main] HTTP server listening | tid="139649469844480" timestamp=1730941877 n_threads_http="3" port="30889" hostname="127.0.0.1" | |
INFO [ update_slots] all slots are idle | tid="139649469844480" timestamp=1730941877 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment