Each of these commands will run an ad hoc http static server in your current (or specified) directory, available at http://localhost:8000. Use this power wisely.
$ python -m SimpleHTTPServer 8000| Latency Comparison Numbers (~2012) | |
| ---------------------------------- | |
| L1 cache reference 0.5 ns | |
| Branch mispredict 5 ns | |
| L2 cache reference 7 ns 14x L1 cache | |
| Mutex lock/unlock 25 ns | |
| Main memory reference 100 ns 20x L2 cache, 200x L1 cache | |
| Compress 1K bytes with Zippy 3,000 ns 3 us | |
| Send 1K bytes over 1 Gbps network 10,000 ns 10 us | |
| Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD |
Each of these commands will run an ad hoc http static server in your current (or specified) directory, available at http://localhost:8000. Use this power wisely.
$ python -m SimpleHTTPServer 8000| #pragma once | |
| #include <exception> | |
| #include <memory> | |
| #include <typeinfo> | |
| #include <type_traits> | |
| class any; | |
| template<class Type> Type any_cast(any&); |
Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggml-org/llama.cpp#5962
In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.
See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix