- Transformers
- Ollama
- llama.cpp
- ExLlamaV2
- AutoGPTQ
- AutoAWQ
- TensorRT-LLM
docs about inference backends: https://www.bentoml.com/blog/benchmarking-llm-inference-backends
- oobabooga
- Stable Diffusion web UI
- SillyTavern
- LM Studio
- Axolatotl
- GPT4all
- Open WebUI
- I've used this one
- Enchanted
- Mac native
- Langchain (TS & Python)
- LLamaindex (TS & Python)
- ModelFusion (TS)
- Haystack (Python)
- Used by AWS, Nvidia, IBM, Intel
- CrewAI (Python)
- Transformers (Python)
- Made by HuggingFace
- PyTorch
- Tensorflow
- JAX
- vokturz/can-it-run-llm
- nyxkrage/gguf-vram-calculator
- QLoRA
- For fine-tuning models
- bycloud
- HuggingFace
- Fireship
- Not exclusively about LLMs/AI
- David Ondrej
Models are usually saved on one of these formats:
- GGUF
- It's a sucessor of GGML
- Tech doc about GGUF (from HuggingFace)
- GGML
- Safetensors
- Exl2
- AWQ
These files contains contexts used by the LLMs
1 tokens ~= 0.75 words
Q4_0
Q4_1
Q5_0
Q5_1
Q8_0
Q3_K_S
Q3_K_M
Q3_K_L
Q4_K_S
Q4_K_M
Q5_K_S
Q5_K_M
Q6_K