Guilherme Pires colobas

Maybe you've heard about this technique but you haven't completely understood it, especially the PPO part. This explanation might help.

We will focus on text-to-text language models 📝, such as GPT-3, BLOOM, and T5. Models like BERT, which are encoder-only, are not addressed.

Reinforcement Learning from Human Feedback (RLHF) has been successfully applied in ChatGPT, hence its major increase in popularity. 📈

RLHF is especially useful in two scenarios 🌟:

You can’t create a good loss function
- Example: how do you calculate a metric to measure if the model’s output was funny?
You want to train with production data, but you can’t easily label your production data

	Latency Comparison Numbers (~2012)
	----------------------------------
	L1 cache reference 0.5 ns
	Branch mispredict 5 ns
	L2 cache reference 7 ns 14x L1 cache
	Mutex lock/unlock 25 ns
	Main memory reference 100 ns 20x L2 cache, 200x L1 cache
	Compress 1K bytes with Zippy 3,000 ns 3 us
	Send 1K bytes over 1 Gbps network 10,000 ns 10 us
	Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD

	/* Hacker News Search Script
	*
	* Original Script by Kristopolous:
	* https://gist.github.com/kristopolous/19260ae54967c2219da8
	*
	* Usage:
	* First, copy the script into your browser's console whilst on the Hacker News
	* jobs page. Then, you can use the query function to filter the results.
	*
	* For example,

	from __future__ import print_function
	import torch
	import torch.nn as nn
	import torch.nn.functional as F
	from torch.autograd import Variable

	def sample_gumbel(shape, eps=1e-20):
	U = torch.rand(shape).cuda()
	return -Variable(torch.log(-torch.log(U + eps) + eps))

	// Moved to a github repo for versioning, the file can now be loaded from a CDN
	// <script type="module" src="https://esm.sh/gh/loueed/[email protected]/comments"></script>

	class BskyComments extends HTMLElement {
	constructor() {
	super();
	this.attachShadow({ mode: "open" });
	this.visibleCount = 3;
	this.thread = null;
	this.error = null;