tazarov · September 20, 2023 14:15 · CarlosS7 · Jan 22, 2024
diff --git a/cq_batching_with_lc.ipynb b/cq_batching_with_lc.ipynb
 {
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "!pip install langchain pdfminer pdf2image pypdf unstructured pdfminer-six\n"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "1b6024ada83b727"
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "initial_id",
   "metadata": {
    "collapsed": true,
    "ExecuteTime": {
     "end_time": "2023-09-20T14:12:15.418584Z",
     "start_time": "2023-09-20T14:12:07.411754Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SQL:  SELECT \"embeddings\".\"id\",\"embeddings\".\"embedding_id\",\"embeddings\".\"seq_id\",\"embedding_metadata\".\"key\",\"embedding_metadata\".\"string_value\",\"embedding_metadata\".\"int_value\",\"embedding_metadata\".\"float_value\",\"embedding_metadata\".\"bool_value\" FROM \"embeddings\" LEFT JOIN \"embedding_metadata\" ON \"embeddings\".\"id\"=\"embedding_metadata\".\"id\" WHERE \"embeddings\".\"segment_id\"=? AND \"embeddings\".\"embedding_id\" IN (?, ?, ?, ?) ORDER BY \"embeddings\".\"id\"\n",
      "Execution time: 0.0006 seconds\n",
      "[Document(page_content='number of grow steps, 𝐼: number of improve steps, 𝑁: number of samples per context\\n\\nTrain 𝜋𝜃 on D using loss L. for 𝑔 = 1 to 𝐺 do // Grow Generate dataset D𝑔 by sampling: D𝑔 = { (𝒙𝑖, 𝒚 𝑖)| Annotate D𝑔 with the reward model 𝑅(𝒙, 𝒚). for 𝑖 = 1 to 𝐼 do // Improve Choose threshold s.t. 𝜏1 > 𝑉𝜋𝜃 for 𝑉𝜋𝜃 = 𝔼D𝑔 [𝑅(𝒙, 𝒚)] and 𝜏𝑖+1 > 𝜏𝑖. while reward improves on D𝑒𝑣𝑎𝑙 do\\n\\nend\\n\\nOptimise 𝜃 on objective: 𝐽 (𝜃) = 𝔼(𝒙,𝒚 )∼D𝑔 [𝐹(𝒙, 𝒚; 𝜏𝑖) L (𝒙, 𝒚; 𝜃)]\\n\\n𝑁𝑔\\n\\n𝑖=1 s.t. 𝒙𝑖 ∼ D, 𝒚 𝑖 ∼ 𝜋𝜃( 𝒚|𝒙𝑖) } ∪ D.\\n\\nend\\n\\nend Output: Policy 𝜋𝜃\\n\\nProbabilistic interpretation of the Improve step Let us consider the particular choice L = LNLL, with 𝜃′ being the parameters of the model from the last Grow step, 𝜆 the proportion of data sampled from this model in D𝑔 and a single step of growth. The expression for the gradient in this case takes the following form:\\n\\n∇𝐽 (𝜃) = −𝔼𝒙∼D\\n\\n(cid:2)𝜆𝔼𝒚∼𝜋𝜃′ ( 𝒚 | 𝒙) [𝐹(𝒙, 𝒚; 𝜏)∇ log 𝜋𝜃( 𝒚 | 𝒙)] + (1 − 𝜆)𝔼𝒚∼𝑝( 𝒚 | 𝒙) [𝐹(𝒙, 𝒚; 𝜏)∇ log 𝜋𝜃( 𝒚 | 𝒙)](cid:3) . (3)\\n\\nThe first term on the RHS of (3) is similar to an online policy gradient term at the beginning of training when 𝜃 ≈ 𝜃′ with 𝐹(𝒙, 𝒚; 𝜏) replacing the state-action value function 𝑄𝜋(𝒙, 𝒚), when starting in state 𝒙 and taking sequential actions 𝒚, that is generating synthetic data 𝒚 using policy 𝜋𝜃 in our context. For the second term on the RHS of (3), we consider the original data D, but we still ensure that it passes the threshold 𝐹(𝒙, 𝒚; 𝜏). Intuitively, people choose D for training according to some possibly unknown criteria. In this work, we make the criterion 𝐹(𝒙, 𝒚; 𝜏) explicit. The last term is therefore a form of offline policy gradients which prevents 𝜋𝜃( 𝒚 | 𝒙) to move too far from 𝑝( 𝒚 | 𝒙) which could lead to model collapse (Shumailov et al., 2023). Finally, note the similarity of this approach with self-training (Clark et al., 2003; Scudder, 1965; Xie et al., 2020) techniques. We provide a population interpretation (i.e., as 𝑁, 𝑁𝑔 → ∞) of ReST in Appendix A.9.\\n\\nIn the following section, we explore how the choice of loss, filtering function and threshold, and synthetic data generated by language policy via sampling (exploration data) empirically affect the performance of the resulting policies 𝜋𝜃.\\n\\n4. Experiments and analysis\\n\\nWe chose machine translation as a testbed for ReST as it is an impactful application of conditional language modeling where established reward models are available, for example, Metric X (Freitag et al., 2022), BLEURT (Sellam et al., 2020) and COMET (Rei et al., 2020). We ran experiments on two common benchmarks: IWSLT 2014 (Cettolo et al., 2014), and WMT 2020 (Koehn et al., 2020),\\n\\n5\\n\\nReinforced Self-Training (ReST) for Language Modeling\\n\\nFigure 3 | ReST with multiple Improve steps. Average reward model scores on IWSLT 2014 De-En, WMT 2020 Zh-En, and Web Domain En-Zh validation sets. On each dataset, we report results with BC (𝐺 = 0, 𝐼 = 0) and ReST with a single Grow step and several Improve steps with an increasing reward threshold. Each Improve step increases the reward model score in all three validation datasets. We found the suitable number of Improve steps to be a dataset-dependent hyperparameter.\\n\\nas well as an internal benchmark dataset which we call Web Domain (a version of this dataset was previously used by Ghorbani et al. (2021)). These datasets contain a set of sentences in the source language and the corresponding human “reference” translation. We selected a different language pair for each dataset to test the generality of the results. We kept a separate validation and test sets with unseen source sentences for the evaluation purposes.', metadata={'source': '/var/folders/xq/s0z93ftx6y917vsp8zwmcj_40000gn/T/tmp4u21fnh2/tmp.pdf'}), Document(page_content='3\\n\\nReinforced Self-Training (ReST) for Language Modeling\\n\\nGrow\\n\\nThe Grow step corresponds to the acting or data-generation step in RL. We create an augmented dataset of trajectories D𝑔 by sampling many output sequences from the current policy 𝜋𝜃, i.e., 𝒚 ∼ 𝜋𝜃( 𝒚|𝒙) for 𝒙 ∼ D. The new dataset of sequences is then scored with a reward function 𝑅(𝒙, 𝒚). The datapoints with the reward above a threshold score are used to update the policy (see next). Once the policy is improved, a new dataset of better quality samples can be created once again (Figure 2, bottom).\\n\\nImprove\\n\\nAt the Improve step (exploitation or policy improvement in RL terminology), the goal is to use the new dataset D𝑔 to fine-tune the policy 𝜋𝜃. We start by defining a filtering function that includes only samples with rewards higher than a certain threshold 𝜏:\\n\\nLet us note that the threshold based filtering function may re- sult into learning suboptimal behaviors that favors outcomes with high variance in the environments with stochastic dy- namics (Brandfonbrener et al., 2022). However, in this work we formulate the language modeling and translation tasks as deterministic RL problems (Appendix A.1.)\\n\\nNext, we finetune the current best policy typically trained with either the supervised learning loss LNLL from equation 1 or an offline RL loss L (𝒙, 𝒚; 𝜃) on the fil- tered data such as V-MPO (Song et al., 2020) or offline actor-critic (Mathieu et al., 2021). To sum up, we use the following reward weighted loss 𝐽:\\n\\n𝐹(𝒙, 𝒚; 𝜏) = 1𝑅 (𝒙,𝒚 ) >𝜏.\\n\\nFigure 2 | ReST algorithm. Top: At Improve steps I=1,I=2,I=3, the dataset from the initial policy is filtered with thresholds 𝜏1 < 𝜏2 < 𝜏3 and a se- quence of policies 𝜋𝜃1 , 𝜋𝜃3 are fine- tuned. Bottom: If we were to sample from those policies (grey), the quality of samples would increase. In practice, only the final policy 𝜋𝜃3 is used to gen- erate the next dataset D𝑔.\\n\\n, 𝜋𝜃2\\n\\n𝐽 (𝜃) = 𝔼(𝒙,𝒚 )∼D𝑔 [𝐹(𝒙, 𝒚; 𝜏) L (𝒙, 𝒚; 𝜃)] .\\n\\n(2)\\n\\nStandard imitation learning approaches, such as BC (Pomerleau (1989), equation 1) and one- step RL methods like Behavior Value Estimation (BVE) (Gulcehre et al., 2021) perform one-step of Improve on the fixed dataset D. In contrast, the basic version of ReST additionally includes a Grow step that allows the model to gather multiple new output sequences (potential translations) for contexts 𝒙 from the original dataset (source sentences to translate).\\n\\nWhen iterating over Improve steps, we increase the filtering thresholds: 𝜏1 < · · · < 𝜏𝑁 −1 < 𝜏𝑁 (Figure 2). This filtering with the growing threshold results in data subsets of increasing quality but of decreasing size. As LLMs overfit to small datasets quickly, we fine-tune every new policy from the previous policy with a lower learning rate. Consecutive fine-tuning of policies {𝜋𝜃𝑘 }𝑘≥1 on higher quality data subsets ensures policy improvement with a fixed dataset D𝑔. If we were to sample from policies {𝜋𝜃𝑘 }𝑘≥1, the average reward of the generated samples would be increasing (shown in grey in Figure 2). As sampling from a policy in the Grow step is computationally expensive, after each such step we perform several Improve steps. Thus, the cost of a single dataset generation is amortised over multiple Improve steps. Algorithm 1 outlines the full ReST algorithm with multiple dataset growth and policy improvement steps.\\n\\n4\\n\\nReinforced Self-Training (ReST) for Language Modeling\\n\\nAlgorithm 1: ReST algorithm. ReST is a growing-batch RL algorithm. Given an initial policy of reasonable quality (for example, pre-trained using BC) iteratively applies Grow and Improve steps to update the policy. Here 𝐹 is a filtering function, and L is an loss function. Input: D: Dataset, D𝑒𝑣𝑎𝑙: Evaluation dataset, L (𝒙, 𝒚; 𝜃): loss, 𝑅(𝒙, 𝒚): reward model, 𝐺:\\n\\nnumber of grow steps, 𝐼: number of improve steps, 𝑁: number of samples per context', metadata={'source': '/var/folders/xq/s0z93ftx6y917vsp8zwmcj_40000gn/T/tmp4u21fnh2/tmp.pdf'}), Document(page_content='Figure 5 | WMT 2020 zh-en (test): BC (in grey, 𝐺 = 0 𝐼 = 0) and ReST trained with different offline RL losses. ReST is trained with one Grow and Improve step except 𝐺 = 1 𝐼 = 0, which is trained on the entire dataset generated after the first Grow step without any Improve (all in purple). All variants of ReST outperform the initial BC baseline, with BC loss resulting in the best performance.\\n\\nWhich loss is the best for a single step of ReST? Figure 5 depicts variants of ReST with different offline RL losses L (𝒙, 𝒚; 𝜃). We find that BC loss outperforms other loss functions. Note that normally BC algorithm does not depend on the reward, but in ReST, the reward is taken into account through the reward filtering stage for 𝐼 ≥ 1 (with 𝜏1 = 0.8 for WMT 2020.) Results with multiple Grow and Improve steps are displayed in Figure 4 (see also Appendix A.6).\\n\\nthat in all our datasets 𝜏1 > 0.7 ≥ 𝑉𝜋𝜃 by empirically measuring 𝑉𝜋𝜃 over the dataset.\\n\\n4If we used less than 6 steps, we skipped the initial smaller thresholds. Details are given in Appendix A.3. We ensured\\n\\n7\\n\\nReinforced Self-Training (ReST) for Language Modeling\\n\\nAlgorithm BC (G=0, I=0) ReST (G=1, I=0) ReST (G=1, I=4) ReST (G=2, I=3) Online RL\\n\\nAverage Reward Distinct samples\\n\\n70.9 71.9 77.8 83.1 71.6\\n\\n16 000 000 16 000 000 16 000 000 32 000 000 24 000 000\\n\\nTable 1 | Online RL for IWSLT 2014: Online RL performs as well as ReST (G=1, I=0) and ReST (G=1, I=4) is significantly better.\\n\\nCan ReST be improved further with Best-of-N sampling at inference time? Best-of-N sampling technique at inference time generates 𝑁 samples which are then ranked by the reward model. Then, the top ranked candidate is selected (Gao et al., 2022). We show results with Best-of-N sampling on top of BC (G=0 I=0) and ReST variants in Figure 6. The performance of ReST improves both with 𝑁 and with the number of Improve steps. The best ReST variant with 𝑁 < 10 matches the performance of the BC model with 𝑁 = 200. Even though RL is known to limit the diversity of samples, this experiment shows that ReST can still benefit from Best-of-N sampling. After three Improve steps with 𝑁 = 200, ReST achieves the highest possible reward of 1, outperforming the “reference” translations in D.\\n\\nHow does ReST compare with Online RL? We com- pared ReST with PPO (Schulman et al., 2017), an online RL algorithm widely used for RLHF (Glaese et al., 2022; Ouyang et al., 2022a). For our online RL experiments, we used the setup of Donato et al. (2022) where PPO had access to a similar amount of training data as ReST with 1 Grow step. The results are summarized in Table 1. Online RL performs as well as ReST with one Grow and no Improve steps which is equivalent to BC on the D𝑔 dataset. With the same amount of training data, ReST with multiple Improve steps achieves significantly higher rewards. Furthermore, we noticed that the BLEU score for the online RL policy on the validation set dropped by nearly 8 points (BLEU score of ReST did not change) which indicates a potential reward hacking behaviour. ReST’s ability to improve the reward model score without dete- riorating the performance on other metrics suggests that the “alignment tax” it pays is lower than for online RL approaches.\\n\\nFigure 6 | Best-of-N sampling at infer- ence time. All variants of ReST benefit as much from Best-of-N sampling as su- pervised models.', metadata={'source': '/var/folders/xq/s0z93ftx6y917vsp8zwmcj_40000gn/T/tmp4u21fnh2/tmp.pdf'}), Document(page_content='We used Metric X in our experiments, a state-of-art reference-free reward model (Freitag et al., 2022) which, for a given source text and a proposed translation, outputs a numerical score. We report results in terms of average rewards on samples generated by a policy on the validation set 1. For the details of the datasets and models, we refer to Appendix A.3. Also, Table 2 indicates the size of the datasets by reporting the number of samples per source sentence generated at each Grow step.\\n\\nNomenclature We named variants of ReST by the loss type, number of Grow steps, and number of Improve steps, for example GOLD G=1 I=2. With this convention, BC G=0 I=0 refers to standard supervised learning, which is trained only on the original dataset D and performs neither Grow nor Improve steps. When the loss type is not specified, the BC loss is used, i.e., the model is trained with auto-regressive supervised learning with the NLL loss as typical in training language models. In all plots, we colored supervised learning in grey and ReST variants in shades of purple.\\n\\nBaselines We reported the results with several different offline RL method, including Offline Actor Critic (OAC) (Mathieu et al., 2021), Behavior VMPO (BVMPO), Generation by Off-policy Learning from Demonstrations (GOLD) (Pang and He, 2021), and BC (Pomerleau, 1989) 2.\\n\\nDo multiple Improve steps in ReST increase the reward model scores? We evaluated ReST on three different datasets by fixing the loss function to BC and increasing the number of Improve steps. The range of rewards for training was normalized between 0 and 1 3. For our experiments, we\\n\\n1Performance on the test set follows the same trends (see Appendix A.4). We also experimented with BLEURT and BLEU scores, and ReST improved those scores as well. We noticed that online PPO algorithm can learn to exploit the weaknesses and biases of these two metrics quickly, which can cause the model to generate samples that maximize the rewards but deteriorate the quality of the model’s output.\\n\\n2Details on our baselines are in Appendix A.8 and the experiments with additional losses are in Appendix A.2 3Note that the plots show rewards between 0 and 100.\\n\\n6\\n\\nReinforced Self-Training (ReST) for Language Modeling\\n\\nFigure 4 | ReST with two Grow steps. The second Grow step with subsequent Improve steps improves the performance by 5.3 points on IWSLT 2014 De-En and 0.8 points on Web Domain En-Zh task over the first Grow step.\\n\\npicked the filtering thresholds 𝜏𝑖 from a sequence of increasing values [0.0, 0.7, 0.8, 0.9, 0.95, 0.99] 4. The 𝜏0 = 0.0 case corresponds to using the full dataset. We did five Improve steps on IWSLT 2014, four on WMT-2020, and two on Web Domain. In Figure 3 we plotted the average reward of different variants of ReST. We see that each subsequent Improve step improves the performance of the translation model significantly across all three datasets.\\n\\nDo additional Grow steps improve reward model scores? We performed a second Grow step with suc- cessive Improve steps to measure the effect of the extra Grow step on the performance. In Figure 4, a method with an additional Grow step achieves further improvement on the IWSLT 2014 and Web Domain datasets. We noticed a 5.3 point improvement be- tween the end of the first and the second Grow step.\\n\\nDoes ReST improve over supervised training? To answer this question, in Figure 5 we plotted the aver- age reward achieved by the supervised learning model as well as several variants of ReST with different losses and the number of Grow and Improve steps. Differ- ent variants of ReST (purple) significantly outperform supervised learning (gray) even after just the first grow step. This observation was consistent across different datasets and language pairs that we tested.', metadata={'source': '/var/folders/xq/s0z93ftx6y917vsp8zwmcj_40000gn/T/tmp4u21fnh2/tmp.pdf'})]\n"
     ]
    }
   ],
   "source": [
    "from chromadb import Settings\n",
    "import uuid\n",
    "import chromadb\n",
    "from chromadb.utils.batch_utils import create_batches\n",
    "from langchain.document_loaders import OnlinePDFLoader\n",
    "from langchain.vectorstores import Chroma\n",
    "from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings\n",
    "# add any other imports\n",
    "client = chromadb.PersistentClient(path=\"./cq_batching_with_lc\",settings=Settings(allow_reset=True))\n",
    "client.reset()\n",
    "embedding_function = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")\n",
    "col = client.get_or_create_collection(\"my_collection\",embedding_function=embedding_function.embed_documents)\n",
    "\n",
    "loader = OnlinePDFLoader(\"https://arxiv.org/pdf/2308.08998.pdf\")\n",
    "pages = loader.load_and_split()\n",
    "for batch in create_batches(\n",
    "        api=client,\n",
    "        ids=[str(uuid.uuid4()) for _ in range(len(pages))],\n",
    "        metadatas=[t.metadata for t in pages],\n",
    "        documents=[t.page_content for t in pages],\n",
    "):\n",
    "    col.add(*batch)\n",
    "\n",
    "db = Chroma(client=client, collection_name=col.name,embedding_function=embedding_function)\n",
    "\n",
    "docs = db.similarity_search(\"What is the Grow step\")\n",
    "print(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false
   },
   "id": "bf267c651d2925db"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
	{
	"cells": [
	{
	"cell_type": "code",
	"execution_count": null,
	"outputs": [],
	"source": [
	"!pip install langchain pdfminer pdf2image pypdf unstructured pdfminer-six\n"
	],
	"metadata": {
	"collapsed": false
	},
	"id": "1b6024ada83b727"
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"id": "initial_id",
	"metadata": {
	"collapsed": true,
	"ExecuteTime": {
	"end_time": "2023-09-20T14:12:15.418584Z",
	"start_time": "2023-09-20T14:12:07.411754Z"
	}
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"SQL: SELECT \"embeddings\".\"id\",\"embeddings\".\"embedding_id\",\"embeddings\".\"seq_id\",\"embedding_metadata\".\"key\",\"embedding_metadata\".\"string_value\",\"embedding_metadata\".\"int_value\",\"embedding_metadata\".\"float_value\",\"embedding_metadata\".\"bool_value\" FROM \"embeddings\" LEFT JOIN \"embedding_metadata\" ON \"embeddings\".\"id\"=\"embedding_metadata\".\"id\" WHERE \"embeddings\".\"segment_id\"=? AND \"embeddings\".\"embedding_id\" IN (?, ?, ?, ?) ORDER BY \"embeddings\".\"id\"\n",
	"Execution time: 0.0006 seconds\n",
	"[Document(page_content='number of grow steps, 𝐼: number of improve steps, 𝑁: number of samples per context\\n\\nTrain 𝜋𝜃 on D using loss L. for 𝑔 = 1 to 𝐺 do // Grow Generate dataset D𝑔 by sampling: D𝑔 = { (𝒙𝑖, 𝒚 𝑖)\| Annotate D𝑔 with the reward model 𝑅(𝒙, 𝒚). for 𝑖 = 1 to 𝐼 do // Improve Choose threshold s.t. 𝜏1 > 𝑉𝜋𝜃 for 𝑉𝜋𝜃 = 𝔼D𝑔 [𝑅(𝒙, 𝒚)] and 𝜏𝑖+1 > 𝜏𝑖. while reward improves on D𝑒𝑣𝑎𝑙 do\\n\\nend\\n\\nOptimise 𝜃 on objective: 𝐽 (𝜃) = 𝔼(𝒙,𝒚 )∼D𝑔 [𝐹(𝒙, 𝒚; 𝜏𝑖) L (𝒙, 𝒚; 𝜃)]\\n\\n𝑁𝑔\\n\\n𝑖=1 s.t. 𝒙𝑖 ∼ D, 𝒚 𝑖 ∼ 𝜋𝜃( 𝒚\|𝒙𝑖) } ∪ D.\\n\\nend\\n\\nend Output: Policy 𝜋𝜃\\n\\nProbabilistic interpretation of the Improve step Let us consider the particular choice L = LNLL, with 𝜃′ being the parameters of the model from the last Grow step, 𝜆 the proportion of data sampled from this model in D𝑔 and a single step of growth. The expression for the gradient in this case takes the following form:\\n\\n∇𝐽 (𝜃) = −𝔼𝒙∼D\\n\\n(cid:2)𝜆𝔼𝒚∼𝜋𝜃′ ( 𝒚 \| 𝒙) [𝐹(𝒙, 𝒚; 𝜏)∇ log 𝜋𝜃( 𝒚 \| 𝒙)] + (1 − 𝜆)𝔼𝒚∼𝑝( 𝒚 \| 𝒙) [𝐹(𝒙, 𝒚; 𝜏)∇ log 𝜋𝜃( 𝒚 \| 𝒙)](cid:3) . (3)\\n\\nThe first term on the RHS of (3) is similar to an online policy gradient term at the beginning of training when 𝜃 ≈ 𝜃′ with 𝐹(𝒙, 𝒚; 𝜏) replacing the state-action value function 𝑄𝜋(𝒙, 𝒚), when starting in state 𝒙 and taking sequential actions 𝒚, that is generating synthetic data 𝒚 using policy 𝜋𝜃 in our context. For the second term on the RHS of (3), we consider the original data D, but we still ensure that it passes the threshold 𝐹(𝒙, 𝒚; 𝜏). Intuitively, people choose D for training according to some possibly unknown criteria. In this work, we make the criterion 𝐹(𝒙, 𝒚; 𝜏) explicit. The last term is therefore a form of offline policy gradients which prevents 𝜋𝜃( 𝒚 \| 𝒙) to move too far from 𝑝( 𝒚 \| 𝒙) which could lead to model collapse (Shumailov et al., 2023). Finally, note the similarity of this approach with self-training (Clark et al., 2003; Scudder, 1965; Xie et al., 2020) techniques. We provide a population interpretation (i.e., as 𝑁, 𝑁𝑔 → ∞) of ReST in Appendix A.9.\\n\\nIn the following section, we explore how the choice of loss, filtering function and threshold, and synthetic data generated by language policy via sampling (exploration data) empirically affect the performance of the resulting policies 𝜋𝜃.\\n\\n4. Experiments and analysis\\n\\nWe chose machine translation as a testbed for ReST as it is an impactful application of conditional language modeling where established reward models are available, for example, Metric X (Freitag et al., 2022), BLEURT (Sellam et al., 2020) and COMET (Rei et al., 2020). We ran experiments on two common benchmarks: IWSLT 2014 (Cettolo et al., 2014), and WMT 2020 (Koehn et al., 2020),\\n\\n5\\n\\nReinforced Self-Training (ReST) for Language Modeling\\n\\nFigure 3 \| ReST with multiple Improve steps. Average reward model scores on IWSLT 2014 De-En, WMT 2020 Zh-En, and Web Domain En-Zh validation sets. On each dataset, we report results with BC (𝐺 = 0, 𝐼 = 0) and ReST with a single Grow step and several Improve steps with an increasing reward threshold. Each Improve step increases the reward model score in all three validation datasets. We found the suitable number of Improve steps to be a dataset-dependent hyperparameter.\\n\\nas well as an internal benchmark dataset which we call Web Domain (a version of this dataset was previously used by Ghorbani et al. (2021)). These datasets contain a set of sentences in the source language and the corresponding human “reference” translation. We selected a different language pair for each dataset to test the generality of the results. We kept a separate validation and test sets with unseen source sentences for the evaluation purposes.', metadata={'source': '/var/folders/xq/s0z93ftx6y917vsp8zwmcj_40000gn/T/tmp4u21fnh2/tmp.pdf'}), Document(page_content='3\\n\\nReinforced Self-Training (ReST) for Language Modeling\\n\\nGrow\\n\\nThe Grow step corresponds to the acting or data-generation step in RL. We create an augmented dataset of trajectories D𝑔 by sampling many output sequences from the current policy 𝜋𝜃, i.e., 𝒚 ∼ 𝜋𝜃( 𝒚\|𝒙) for 𝒙 ∼ D. The new dataset of sequences is then scored with a reward function 𝑅(𝒙, 𝒚). The datapoints with the reward above a threshold score are used to update the policy (see next). Once the policy is improved, a new dataset of better quality samples can be created once again (Figure 2, bottom).\\n\\nImprove\\n\\nAt the Improve step (exploitation or policy improvement in RL terminology), the goal is to use the new dataset D𝑔 to fine-tune the policy 𝜋𝜃. We start by defining a filtering function that includes only samples with rewards higher than a certain threshold 𝜏:\\n\\nLet us note that the threshold based filtering function may re- sult into learning suboptimal behaviors that favors outcomes with high variance in the environments with stochastic dy- namics (Brandfonbrener et al., 2022). However, in this work we formulate the language modeling and translation tasks as deterministic RL problems (Appendix A.1.)\\n\\nNext, we finetune the current best policy typically trained with either the supervised learning loss LNLL from equation 1 or an offline RL loss L (𝒙, 𝒚; 𝜃) on the fil- tered data such as V-MPO (Song et al., 2020) or offline actor-critic (Mathieu et al., 2021). To sum up, we use the following reward weighted loss 𝐽:\\n\\n𝐹(𝒙, 𝒚; 𝜏) = 1𝑅 (𝒙,𝒚 ) >𝜏.\\n\\nFigure 2 \| ReST algorithm. Top: At Improve steps I=1,I=2,I=3, the dataset from the initial policy is filtered with thresholds 𝜏1 < 𝜏2 < 𝜏3 and a se- quence of policies 𝜋𝜃1 , 𝜋𝜃3 are fine- tuned. Bottom: If we were to sample from those policies (grey), the quality of samples would increase. In practice, only the final policy 𝜋𝜃3 is used to gen- erate the next dataset D𝑔.\\n\\n, 𝜋𝜃2\\n\\n𝐽 (𝜃) = 𝔼(𝒙,𝒚 )∼D𝑔 [𝐹(𝒙, 𝒚; 𝜏) L (𝒙, 𝒚; 𝜃)] .\\n\\n(2)\\n\\nStandard imitation learning approaches, such as BC (Pomerleau (1989), equation 1) and one- step RL methods like Behavior Value Estimation (BVE) (Gulcehre et al., 2021) perform one-step of Improve on the fixed dataset D. In contrast, the basic version of ReST additionally includes a Grow step that allows the model to gather multiple new output sequences (potential translations) for contexts 𝒙 from the original dataset (source sentences to translate).\\n\\nWhen iterating over Improve steps, we increase the filtering thresholds: 𝜏1 < · · · < 𝜏𝑁 −1 < 𝜏𝑁 (Figure 2). This filtering with the growing threshold results in data subsets of increasing quality but of decreasing size. As LLMs overfit to small datasets quickly, we fine-tune every new policy from the previous policy with a lower learning rate. Consecutive fine-tuning of policies {𝜋𝜃𝑘 }𝑘≥1 on higher quality data subsets ensures policy improvement with a fixed dataset D𝑔. If we were to sample from policies {𝜋𝜃𝑘 }𝑘≥1, the average reward of the generated samples would be increasing (shown in grey in Figure 2). As sampling from a policy in the Grow step is computationally expensive, after each such step we perform several Improve steps. Thus, the cost of a single dataset generation is amortised over multiple Improve steps. Algorithm 1 outlines the full ReST algorithm with multiple dataset growth and policy improvement steps.\\n\\n4\\n\\nReinforced Self-Training (ReST) for Language Modeling\\n\\nAlgorithm 1: ReST algorithm. ReST is a growing-batch RL algorithm. Given an initial policy of reasonable quality (for example, pre-trained using BC) iteratively applies Grow and Improve steps to update the policy. Here 𝐹 is a filtering function, and L is an loss function. Input: D: Dataset, D𝑒𝑣𝑎𝑙: Evaluation dataset, L (𝒙, 𝒚; 𝜃): loss, 𝑅(𝒙, 𝒚): reward model, 𝐺:\\n\\nnumber of grow steps, 𝐼: number of improve steps, 𝑁: number of samples per context', metadata={'source': '/var/folders/xq/s0z93ftx6y917vsp8zwmcj_40000gn/T/tmp4u21fnh2/tmp.pdf'}), Document(page_content='Figure 5 \| WMT 2020 zh-en (test): BC (in grey, 𝐺 = 0 𝐼 = 0) and ReST trained with different offline RL losses. ReST is trained with one Grow and Improve step except 𝐺 = 1 𝐼 = 0, which is trained on the entire dataset generated after the first Grow step without any Improve (all in purple). All variants of ReST outperform the initial BC baseline, with BC loss resulting in the best performance.\\n\\nWhich loss is the best for a single step of ReST? Figure 5 depicts variants of ReST with different offline RL losses L (𝒙, 𝒚; 𝜃). We find that BC loss outperforms other loss functions. Note that normally BC algorithm does not depend on the reward, but in ReST, the reward is taken into account through the reward filtering stage for 𝐼 ≥ 1 (with 𝜏1 = 0.8 for WMT 2020.) Results with multiple Grow and Improve steps are displayed in Figure 4 (see also Appendix A.6).\\n\\nthat in all our datasets 𝜏1 > 0.7 ≥ 𝑉𝜋𝜃 by empirically measuring 𝑉𝜋𝜃 over the dataset.\\n\\n4If we used less than 6 steps, we skipped the initial smaller thresholds. Details are given in Appendix A.3. We ensured\\n\\n7\\n\\nReinforced Self-Training (ReST) for Language Modeling\\n\\nAlgorithm BC (G=0, I=0) ReST (G=1, I=0) ReST (G=1, I=4) ReST (G=2, I=3) Online RL\\n\\nAverage Reward Distinct samples\\n\\n70.9 71.9 77.8 83.1 71.6\\n\\n16 000 000 16 000 000 16 000 000 32 000 000 24 000 000\\n\\nTable 1 \| Online RL for IWSLT 2014: Online RL performs as well as ReST (G=1, I=0) and ReST (G=1, I=4) is significantly better.\\n\\nCan ReST be improved further with Best-of-N sampling at inference time? Best-of-N sampling technique at inference time generates 𝑁 samples which are then ranked by the reward model. Then, the top ranked candidate is selected (Gao et al., 2022). We show results with Best-of-N sampling on top of BC (G=0 I=0) and ReST variants in Figure 6. The performance of ReST improves both with 𝑁 and with the number of Improve steps. The best ReST variant with 𝑁 < 10 matches the performance of the BC model with 𝑁 = 200. Even though RL is known to limit the diversity of samples, this experiment shows that ReST can still benefit from Best-of-N sampling. After three Improve steps with 𝑁 = 200, ReST achieves the highest possible reward of 1, outperforming the “reference” translations in D.\\n\\nHow does ReST compare with Online RL? We com- pared ReST with PPO (Schulman et al., 2017), an online RL algorithm widely used for RLHF (Glaese et al., 2022; Ouyang et al., 2022a). For our online RL experiments, we used the setup of Donato et al. (2022) where PPO had access to a similar amount of training data as ReST with 1 Grow step. The results are summarized in Table 1. Online RL performs as well as ReST with one Grow and no Improve steps which is equivalent to BC on the D𝑔 dataset. With the same amount of training data, ReST with multiple Improve steps achieves significantly higher rewards. Furthermore, we noticed that the BLEU score for the online RL policy on the validation set dropped by nearly 8 points (BLEU score of ReST did not change) which indicates a potential reward hacking behaviour. ReST’s ability to improve the reward model score without dete- riorating the performance on other metrics suggests that the “alignment tax” it pays is lower than for online RL approaches.\\n\\nFigure 6 \| Best-of-N sampling at infer- ence time. All variants of ReST benefit as much from Best-of-N sampling as su- pervised models.', metadata={'source': '/var/folders/xq/s0z93ftx6y917vsp8zwmcj_40000gn/T/tmp4u21fnh2/tmp.pdf'}), Document(page_content='We used Metric X in our experiments, a state-of-art reference-free reward model (Freitag et al., 2022) which, for a given source text and a proposed translation, outputs a numerical score. We report results in terms of average rewards on samples generated by a policy on the validation set 1. For the details of the datasets and models, we refer to Appendix A.3. Also, Table 2 indicates the size of the datasets by reporting the number of samples per source sentence generated at each Grow step.\\n\\nNomenclature We named variants of ReST by the loss type, number of Grow steps, and number of Improve steps, for example GOLD G=1 I=2. With this convention, BC G=0 I=0 refers to standard supervised learning, which is trained only on the original dataset D and performs neither Grow nor Improve steps. When the loss type is not specified, the BC loss is used, i.e., the model is trained with auto-regressive supervised learning with the NLL loss as typical in training language models. In all plots, we colored supervised learning in grey and ReST variants in shades of purple.\\n\\nBaselines We reported the results with several different offline RL method, including Offline Actor Critic (OAC) (Mathieu et al., 2021), Behavior VMPO (BVMPO), Generation by Off-policy Learning from Demonstrations (GOLD) (Pang and He, 2021), and BC (Pomerleau, 1989) 2.\\n\\nDo multiple Improve steps in ReST increase the reward model scores? We evaluated ReST on three different datasets by fixing the loss function to BC and increasing the number of Improve steps. The range of rewards for training was normalized between 0 and 1 3. For our experiments, we\\n\\n1Performance on the test set follows the same trends (see Appendix A.4). We also experimented with BLEURT and BLEU scores, and ReST improved those scores as well. We noticed that online PPO algorithm can learn to exploit the weaknesses and biases of these two metrics quickly, which can cause the model to generate samples that maximize the rewards but deteriorate the quality of the model’s output.\\n\\n2Details on our baselines are in Appendix A.8 and the experiments with additional losses are in Appendix A.2 3Note that the plots show rewards between 0 and 100.\\n\\n6\\n\\nReinforced Self-Training (ReST) for Language Modeling\\n\\nFigure 4 \| ReST with two Grow steps. The second Grow step with subsequent Improve steps improves the performance by 5.3 points on IWSLT 2014 De-En and 0.8 points on Web Domain En-Zh task over the first Grow step.\\n\\npicked the filtering thresholds 𝜏𝑖 from a sequence of increasing values [0.0, 0.7, 0.8, 0.9, 0.95, 0.99] 4. The 𝜏0 = 0.0 case corresponds to using the full dataset. We did five Improve steps on IWSLT 2014, four on WMT-2020, and two on Web Domain. In Figure 3 we plotted the average reward of different variants of ReST. We see that each subsequent Improve step improves the performance of the translation model significantly across all three datasets.\\n\\nDo additional Grow steps improve reward model scores? We performed a second Grow step with suc- cessive Improve steps to measure the effect of the extra Grow step on the performance. In Figure 4, a method with an additional Grow step achieves further improvement on the IWSLT 2014 and Web Domain datasets. We noticed a 5.3 point improvement be- tween the end of the first and the second Grow step.\\n\\nDoes ReST improve over supervised training? To answer this question, in Figure 5 we plotted the aver- age reward achieved by the supervised learning model as well as several variants of ReST with different losses and the number of Grow and Improve steps. Differ- ent variants of ReST (purple) significantly outperform supervised learning (gray) even after just the first grow step. This observation was consistent across different datasets and language pairs that we tested.', metadata={'source': '/var/folders/xq/s0z93ftx6y917vsp8zwmcj_40000gn/T/tmp4u21fnh2/tmp.pdf'})]\n"
	]
	}
	],
	"source": [
	"from chromadb import Settings\n",
	"import uuid\n",
	"import chromadb\n",
	"from chromadb.utils.batch_utils import create_batches\n",
	"from langchain.document_loaders import OnlinePDFLoader\n",
	"from langchain.vectorstores import Chroma\n",
	"from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings\n",
	"# add any other imports\n",
	"client = chromadb.PersistentClient(path=\"./cq_batching_with_lc\",settings=Settings(allow_reset=True))\n",
	"client.reset()\n",
	"embedding_function = SentenceTransformerEmbeddings(model_name=\"all-MiniLM-L6-v2\")\n",
	"col = client.get_or_create_collection(\"my_collection\",embedding_function=embedding_function.embed_documents)\n",
	"\n",
	"loader = OnlinePDFLoader(\"https://arxiv.org/pdf/2308.08998.pdf\")\n",
	"pages = loader.load_and_split()\n",
	"for batch in create_batches(\n",
	" api=client,\n",
	" ids=[str(uuid.uuid4()) for _ in range(len(pages))],\n",
	" metadatas=[t.metadata for t in pages],\n",
	" documents=[t.page_content for t in pages],\n",
	"):\n",
	" col.add(*batch)\n",
	"\n",
	"db = Chroma(client=client, collection_name=col.name,embedding_function=embedding_function)\n",
	"\n",
	"docs = db.similarity_search(\"What is the Grow step\")\n",
	"print(docs)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"outputs": [],
	"source": [],
	"metadata": {
	"collapsed": false
	},
	"id": "bf267c651d2925db"
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.6"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 5
	}