Vicki Boykis veekaybee

Reinforcement Learning for Language Models

Yoav Goldberg, April 2023.

Why RL?

With the release of the ChatGPT model and followup large language models (LLMs), there was a lot of discussion of the importance of "RLHF training", that is, "reinforcement learning from human feedback". I was puzzled for a while as to why RL (Reinforcement Learning) is better than learning from demonstrations (a.k.a supervised learning) for training language models. Shouldn't learning from demonstrations (or, in language model terminology "instruction fine tuning", learning to immitate human written answers) be sufficient? I came up with a theoretical argument that was somewhat convincing. But I came to realize there is an additional argumment which not only supports the case of RL training, but also requires it, in particular for models like ChatGPT. This additional argument is spelled out in (the first half of) a talk by John Schulman from OpenAI. This post pretty much

LLM Samplers Explained

Everytime a large language model makes predictions, all of the thousands of tokens in the vocabulary are assigned some degree of probability, from almost 0%, to almost 100%. There are different ways you can decide to choose from those predictions. This process is known as "sampling", and there are various strategies you can use which I will cover here.

OpenAI Samplers

Temperature

Temperature is a way to control the overall confidence of the model's scores (the logits). What this means is that, if you use a lower value than 1.0, the relative distance between the tokens will become larger (more deterministic), and if you use a larger value than 1.0, the relative distance between the tokens becomes smaller (less deterministic).
1.0 Temperature is the original distribution that the model was trained to optimize for, since the scores remain the same.
Graph demonstration with voiceover: https://files.catbox.moe/6ht56x.mp4

	id: test
	namespace: dev
	tasks:
	- id: deploy
	type: io.kestra.core.tasks.flows.Worker
	tasks:
	- id: cloneRepository
	type: io.kestra.plugin.git.Clone
	url: https://github.com/veekaybee/viberary
	branch: main

	<#meta#>
	- Date: 2023-10-05
	- Task: summary
	<#system#>
	Your main objective is to condense the content of the document into a concise summary, capturing the main points and themes.
	<#chat#>
	<#user#>
	Please read the provided Original section to understand the context and content. Use this understanding to generate a summary of the Original section, incorporating relevant details and maintaining coherence with the Prior Summary.

	Notes:

	import inspect
	import json
	import re
	import typing
	from inspect import isclass, getdoc
	from types import NoneType

	from pydantic import BaseModel, Field
	from pydantic.fields import FieldInfo
	from typing import Any, Type, List, get_args, get_origin, Tuple, Union, Optional

	from typing import List, Dict, Literal, Union
	from transformers import AutoTokenizer


	class MistralAICtx:

	def __init__(self, model_name: str):
	assert "mistral" in model_name, "MistralCtx only available for Mistral models"
	self.tokenizer = AutoTokenizer.from_pretrained(
	"mistralai/Mistral-7B-Instruct-v0.2")

	#!/bin/bash

	# Get the git diff and save it to a temporary file
	git diff --cached > /tmp/git_diff.txt

	# If there's no diff, exit
	if [ ! -s /tmp/git_diff.txt ]; then
	echo "No staged changes to commit"
	exit 1
	fi