Steffen Röcker sroecker

Which GGUF is right for me? (Opinionated)

Good question! I am collecting human data on how quantization affects outputs. See here for more information: ggml-org/llama.cpp#5962

In the meantime, use the largest that fully fits in your GPU. If you can comfortably fit Q4_K_S, try using a model with more parameters.

llama.cpp feature matrix

See the wiki upstream: https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix

LLM Samplers Explained

Everytime a large language model makes predictions, all of the thousands of tokens in the vocabulary are assigned some degree of probability, from almost 0%, to almost 100%. There are different ways you can decide to choose from those predictions. This process is known as "sampling", and there are various strategies you can use which I will cover here.

OpenAI Samplers

Temperature

Temperature is a way to control the overall confidence of the model's scores (the logits). What this means is that, if you use a lower value than 1.0, the relative distance between the tokens will become larger (more deterministic), and if you use a larger value than 1.0, the relative distance between the tokens becomes smaller (less deterministic).
1.0 Temperature is the original distribution that the model was trained to optimize for, since the scores remain the same.
Graph demonstration with voiceover: https://files.catbox.moe/6ht56x.mp4

Anti-hype LLM reading list

Goals: Add links that are reasonable and good explanations of how stuff works. No hype and no vendor content if possible. Practical first-hand accounts of models in prod eagerly sought.

Foundational Concepts

	import streamlit as st
	import concurrent.futures # We'll do computations in separate processes!
	import mymodule # This is where you'll do the computation

	# Your st calls must go inside this IF block.
	if __name__ == '__main__':
	st.write("Starting a long computation on another process")

	# Pick max number of concurrent processes. Depends on how heavy your computation is, and how
	# powerful your machine is.

	# install DSPy: pip install dspy
	import dspy

	# Ollam is now compatible with OpenAI APIs
	#
	# To get this to work you must include `model_type='chat'` in the `dspy.OpenAI` call.
	# If you do not include this you will get an error.
	#
	# I have also found that `stop='\n\n'` is required to get the model to stop generating text after the ansewr is complete.
	# At least with mistral.

	from typing import List, Optional, TypedDict
	import modal
	from modal import gpu, build, enter, exit, method


	class Document(TypedDict):
	content: str
	metadata: dict

	from datasets import Dataset, Features
	from datasets import Image as ImageFeature
	from datasets import Value
	import pandas as pd
	import os

	# CSV comes from the notebook above.
	df = pd.read_csv("coco_30k_randomly_sampled_2014_val.csv")
	root_path = "val2014"

	#!/usr/bin/env bash
	# based off of https://gist.github.com/rauchg/c5f0b1dc245ad95c593de8336aa382ac?permalink_comment_id=4842642#gistcomment-4842642
	if [ "$#" -eq 0 ]; then
	echo "Usage: $(basename $0) promt_to_send_to_perplexity"
	echo ""
	echo " Requirements: PERPLEXITY_API, defined; jq and curl, installed; bash, version 3 or higher."
	exit 1
	fi

	function p() {

	#!/usr/bin/env bash
	function p() {
	jq -n \
	--arg content "$*" \
	'{
	"model": "pplx-7b-online",
	"messages": [
	{
	"role": "system",
	"content": "Be precise and concise."