Bairen Yi byronyi

Reinforcement Learning for Language Models

Yoav Goldberg, April 2023.

Why RL?

With the release of the ChatGPT model and followup large language models (LLMs), there was a lot of discussion of the importance of "RLHF training", that is, "reinforcement learning from human feedback". I was puzzled for a while as to why RL (Reinforcement Learning) is better than learning from demonstrations (a.k.a supervised learning) for training language models. Shouldn't learning from demonstrations (or, in language model terminology "instruction fine tuning", learning to immitate human written answers) be sufficient? I came up with a theoretical argument that was somewhat convincing. But I came to realize there is an additional argumment which not only supports the case of RL training, but also requires it, in particular for models like ChatGPT. This additional argument is spelled out in (the first half of) a talk by John Schulman from OpenAI. This post pretty much

	# Dependencies:
	# portpicker (pip install portpicker)
	# tcmalloc4 (sudo apt-get install google-perftools)
	# TF 0.12
	#
	#
	# Benchmarks on Xeon E5-2630 v3 @ 2.40GHz
	#
	# export LD_PRELOAD=/usr/lib/libtcmalloc.so.4
	# python benchmark_grpc_recv.py --data_mb=128

	<domain type='kvm'>
	<name>ubuntu-4b</name>
	<uuid>7dfbcb8a-77da-11e6-a116-408d5cb4b9e6</uuid>
	<memory unit='KiB'>12582912</memory>
	<currentMemory unit='KiB'>12582912</currentMemory>
	<vcpu placement='static'>2</vcpu>
	<os>
	<type arch='x86_64' machine='pc-q35-2.5'>hvm</type>
	<loader readonly='no' type='pflash'>/usr/share/OVMF/OVMF_CODE.fd</loader>
	<nvram>/var/lib/libvirt/qemu/nvram/ubuntu-4b_VARS.fd</nvram>

	#!/bin/bash

	TENSORFLOW_COMMIT=9e76bf324f6bac63137a02bb6e6ec9120703ea9b # August 16, 2017
	TENSORFLOW_SERVING_COMMIT=267d682bf43df1c8e87332d3712c411baf162fe9 # August 18, 2017
	MODELS_COMMIT=78007443138108abf5170b296b4d703b49454487 # July 25, 2017

	if [ -z $TENSORFLOW_SERVING_REPO_PATH ]; then
	TENSORFLOW_SERVING_REPO_PATH="serving"
	fi
	INITIAL_PATH=$(pwd)

	from typing import Tuple
	import gc

	import torch
	import torch.nn.functional as F
	import triton
	import triton.language as tl
	import triton.testing