Skip to content

Instantly share code, notes, and snippets.

View byronyi's full-sized avatar
:octocat:
Just for fun

Bairen Yi byronyi

:octocat:
Just for fun
View GitHub Profile

Reinforcement Learning for Language Models

Yoav Goldberg, April 2023.

Why RL?

With the release of the ChatGPT model and followup large language models (LLMs), there was a lot of discussion of the importance of "RLHF training", that is, "reinforcement learning from human feedback". I was puzzled for a while as to why RL (Reinforcement Learning) is better than learning from demonstrations (a.k.a supervised learning) for training language models. Shouldn't learning from demonstrations (or, in language model terminology "instruction fine tuning", learning to immitate human written answers) be sufficient? I came up with a theoretical argument that was somewhat convincing. But I came to realize there is an additional argumment which not only supports the case of RL training, but also requires it, in particular for models like ChatGPT. This additional argument is spelled out in (the first half of) a talk by John Schulman from OpenAI. This post pretty much

git bisect start
# good: [9b302513f6d82f0ca989b3bb1f5ffc592ed866b7] [AArch64] Add missing intrinsics for vrnd
git bisect good 9b302513f6d82f0ca989b3bb1f5ffc592ed866b7
# bad: [c9ff39a3f9840c84453f23a37386a3dc374f055a] Add "assert require" for the test added in df9158c9a45a6902c2b0394f9bd6512e3e441f31
git bisect bad c9ff39a3f9840c84453f23a37386a3dc374f055a
# good: [002dd47bdd674fad8186650f07458b1e062545df] [clang] Fix typos in the default logic for CLANG_DEFAULT_UNWINDLIB
git bisect good 002dd47bdd674fad8186650f07458b1e062545df
# bad: [bb6732cf622522f17dad948279ba4f68e3bd55e1] [MC] Add parseEOL() overload and migrate some parseToken(AsmToken::EndOfStatement) to parseEOL()
git bisect bad bb6732cf622522f17dad948279ba4f68e3bd55e1
# good: [06a8a867d1591cfdab65037eabd7e865113dc7a6] [rs4gc/tests] Remove use of internal debug flags
#define EIGEN_USE_THREADS
#include "tensorflow/core/common_runtime/optimization_registry.h"
#include "tensorflow/core/framework/common_shape_fns.h"
#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/op_kernel.h"
#include "tensorflow/core/framework/rendezvous.h"
#include "tensorflow/core/framework/shape_inference.h"
#include "tensorflow/core/graph/algorithm.h"
#include "tensorflow/core/graph/edgeset.h"

This is a test gist

Hello gist!

@byronyi
byronyi / compile_tensorflow_serving.sh
Created October 7, 2017 01:42 — forked from jorgemf/compile_tensorflow_serving.sh
Compile TensorFlow Serving with CUDA support (October 2017)
#!/bin/bash
TENSORFLOW_COMMIT=9e76bf324f6bac63137a02bb6e6ec9120703ea9b # August 16, 2017
TENSORFLOW_SERVING_COMMIT=267d682bf43df1c8e87332d3712c411baf162fe9 # August 18, 2017
MODELS_COMMIT=78007443138108abf5170b296b4d703b49454487 # July 25, 2017
if [ -z $TENSORFLOW_SERVING_REPO_PATH ]; then
TENSORFLOW_SERVING_REPO_PATH="serving"
fi
INITIAL_PATH=$(pwd)
@byronyi
byronyi / ubuntu-vm.xml
Last active September 20, 2017 12:39 — forked from calerogers/ubuntu-vm.xml
<domain type='kvm'>
<name>ubuntu-4b</name>
<uuid>7dfbcb8a-77da-11e6-a116-408d5cb4b9e6</uuid>
<memory unit='KiB'>12582912</memory>
<currentMemory unit='KiB'>12582912</currentMemory>
<vcpu placement='static'>2</vcpu>
<os>
<type arch='x86_64' machine='pc-q35-2.5'>hvm</type>
<loader readonly='no' type='pflash'>/usr/share/OVMF/OVMF_CODE.fd</loader>
<nvram>/var/lib/libvirt/qemu/nvram/ubuntu-4b_VARS.fd</nvram>
2017-08-03 00:02:21.913337: I tensorflow/core/distributed_runtime/rpc/grpc_remote_worker.cc:176] done callback, req: step_id: 102800540342059661
rendezvous_key: "/job:ps/replica:0/task:0/cpu:0;57cb0fff5df077fe;/job:worker/replica:0/task:0/cpu:0;edge_68_report_uninitialized_variables/boolean_mask/Gather;0:0"
dma_ok: true
response send_start_micros: 1501689741913046
2017-08-03 00:02:21.913389: I tensorflow/core/common_runtime/executor.cc:1611] 0x7fff4c108eb0 Async kernel done: report_uninitialized_variables/boolean_mask/Gather_S15 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/cpu:0", send_device="/job:ps/replica:0/task:0/cpu:0", send_device_incarnation=6326167691039111166, tensor_name="edge_68_report_uninitialized_variables/boolean_mask/Gather", tensor_type=DT_STRING, _device="/job:worker/replica:0/task:0/cpu:0"]()
2017-08-03 00:02:21.913411: I tensorflow/core/framework/log_memory.cc:35] __LOG_MEMORY__ MemoryLogTensorOutput { step_id: 4 kernel_name: "report_uninitialized_variables
2017-08-02 23:42:44.550529: I tensorflow/core/common_runtime/bfc_allocator.cc:56] Creating bin of max chunk size 256B
2017-08-02 23:42:44.550587: I tensorflow/core/common_runtime/bfc_allocator.cc:56] Creating bin of max chunk size 512B
2017-08-02 23:42:44.550607: I tensorflow/core/common_runtime/bfc_allocator.cc:56] Creating bin of max chunk size 1.0KiB
2017-08-02 23:42:44.550614: I tensorflow/core/common_runtime/bfc_allocator.cc:56] Creating bin of max chunk size 2.0KiB
2017-08-02 23:42:44.550620: I tensorflow/core/common_runtime/bfc_allocator.cc:56] Creating bin of max chunk size 4.0KiB
2017-08-02 23:42:44.550627: I tensorflow/core/common_runtime/bfc_allocator.cc:56] Creating bin of max chunk size 8.0KiB
2017-08-02 23:42:44.550634: I tensorflow/core/common_runtime/bfc_allocator.cc:56] Creating bin of max chunk size 16.0KiB
2017-08-02 23:42:44.550641: I tensorflow/core/common_runtime/bfc_allocator.cc:56] Creating bin of max chunk size 32.0KiB
2017-08-02 23:42:44.550647: I tensorflow/core/common_runtime/bfc_a
package(default_visibility = ["//visibility:public"])
licenses(["notice"]) # OpenIB.org BSD license (MIT variant)
exports_files(["COPYING"])
cc_library(
name = "rdmacm",
hdrs = [
"include/rdma/rdma_cma.h",
package(default_visibility = ["//visibility:public"])
licenses(["notice"]) # OpenIB.org BSD license (MIT variant)
exports_files(["COPYING"])
cc_library(
name = "ibverbs",
hdrs = [
"include/infiniband/sa.h",