-XX:NativeMemoryTracking=detail
jps
ps -p <PID> -o pcpu,rss,size,vsize
| @Grapes([ | |
| @Grab(group='org.gperfutils', module='gbench', version='[0.4,)'), | |
| @Grab(group='org.apache.commons', module='commons-lang3', version='[3.7,)'), | |
| @Grab(group='joda-time', module='joda-time', version='[2.9.9,)') | |
| ]) | |
| /** | |
| * Java 8 DateTimeFormatter | |
| */ | |
| import java.util.Date; |
Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggml-org/llama.cpp#5962
In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.
See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix