kalomaze

the generic basics of preference reward modeling

The Bradley-Terry model works like this:

It's based on a chosen/rejected split
The model is trained on binary judgements of specific content/samples as being either 'preferred' or 'dispreferred'
The log ratio between preferred and dispreferred can be used as the natural reward signal

What the original function actually looked like:

f32 atan2(f32 startX, f32 startZ, f32 endX, f32 endZ) {
    f32 xDiff = endX - startX;
    f32 zDiff = endZ - startZ;
    f32 absXDiff = fabsf(xDiff);
    f32 absZDiff = fabsf(zDiff);
    f32 ret;

Original Text ("Reading a Sign 43 Times Heals Your Axe Durability" by Hunter R.)

We all know that the axe in Animal Crossing will usually break after using it too much. Of course, the axe is intentionally designed to break like this in order to make the unbreakable Golden Axe an appealing item to unlock. And yet what if I told you that by simply reading a sign over and over you can actually prevent your standard axe from ever breaking? And no, I'm not joking—you can actually sit here and read this sign over and over to heal the durability on your axe, making it theoretically invincible. I'm sure a lot of you are wondering how or why this even works, so let's take a closer look.

Creating an unbreakable axe is a really funny glitch that was recently discovered by Animal Crossing spreadsheet owner Phil. To understand how interacting with a sign heals your axe, let's discuss how axe durability works.

Normally an axe can withstand 72 hits on normal trees before breaking. Since trees take three hits to cut

	class IntermediateSequenceAverageLoss(torch.nn.Module):
	"""
	Custom loss function that calculates the average of intermediate sequence averages,
	with proper token shifting for causal language modeling.

	For a sequence of length n, this calculates:
	1. Average loss of token 1 predicting token 2
	2. Average loss of tokens 1-2 predicting tokens 2-3
	3. Average loss of tokens 1-3 predicting tokens 2-4
	...

	# the "verifiers" repository is a clean implementation of templated GRPO reinforcement learning training environments
	# this is a generic set of "install from scratch" commands complete with a deepspeed z3 config that i have been using when i spin up nodes
	# it will run on the gsm8k example w/ default batch size & generation size (8), and the 8th GPU is used for vllm generations
	# qwen 14b full finetuning will run on this configuration too without LoRA or CUDA OOM, at least for the gsm8k task's context sizes + generation lengths
	# hyperparameters are controlled by `verifiers/utils/config_utils.py`; i have been preferring extreme grad clipping (between 0.001 and 0.01) and low beta (under 0.01)

	# NOTE FEB 27: examples have moved into `verifiers/examples` not `/examples`

	cd /root
	mkdir boom

	class RescaleDescentTrainer(Trainer):
	def __init__(self, args, *kwargs):
	super().__init__(args, *kwargs)
	# Initialize all buffers
	self.tokens_buffer = [] # for raw token loss
	self.weighted_tokens_buffer = [] # for entropy weighted token loss
	self.unigram_rate_buffer = []
	self.bigram_rate_buffer = []
	self.trigram_rate_buffer = []
	self.weighted_unigram_buffer = []

	from transformers import AutoTokenizer
	from huggingface_hub import snapshot_download
	import os

	def add_token_boundaries(tokenizer, tokens):
	"""Add brackets around token boundaries"""
	text = ""
	for token in tokens:
	decoded = tokenizer.decode([token])
	text += f"[{decoded}] "

	import sys
	import random
	import numpy as np
	import string
	from datetime import datetime
	from PIL import Image, ImageEnhance, ImageOps
	from PyQt5.QtWidgets import (QApplication, QMainWindow, QWidget, QVBoxLayout,
	QHBoxLayout, QTextEdit, QPushButton, QCheckBox,
	QLabel, QSpinBox, QComboBox, QSlider, QFileDialog,
	QFrame)

	how long until 450 steps exactly?

	69%\|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ \| 429/625 [12:10:27<6:29:27, 119.22s/it]

	import React, { useState } from 'react';
	import { Settings, Bookmark, Download, Library, HelpCircle, RefreshCw, ArrowLeft } from 'lucide-react';

	const STORY_BRANCHES = {
	root: {
	text: `The darkness grew absolute, not that the hyperstitioner could see in the first place. His ears pricked up, however; he could hear the skittering, the mechanical hum as the machine followed him invisibly...`,
	continuations: [
	{
	id: 'a1',
	text: " The mechanical tendrils wrapped tighter around his shoulder, its grip a cold reminder of their symbiosis...",