simon-mo · October 31, 2023 08:25
diff --git a/osd.md b/osd.md
diff --git a/osd.tree b/osd.tree
 root[102] (1:1-879:1, 0-48901)
 ├─0   thematicBreak (1:1-1:4, 0-3)
 ├─1   paragraph[3] (2:1-27:8, 4-1653)
 │     ├─0 text "abstract: |\nSpeculative decoding is a pivotal technique to accelerate the\ninference of large language models (LLMs) by employing a smaller draft\nmodel to predict the target model’s outputs. However, its efficacy can\nbe limited due to the low predictive accuracy of the draft model,\nparticularly when faced with diverse text inputs and a significant\ncapability gap between the draft and target models. We introduce\nonline speculative decoding to address this challenge. The main idea\nis to continually update (multiple) draft model(s) on observed user\nquery data using the abundant excess computational power in an LLM\nserving cluster. Given that LLM inference is memory-bounded, the\nsurplus computational power in a typical LLM serving cluster can be\nrepurposed for online retraining of draft models, thereby making the\ntraining cost-neutral. Since the query distribution of an LLM service\nis relatively simple, retraining on query distribution enables the\ndraft model to more accurately predict the target model’s outputs,\nparticularly on data originating from query distributions. As the\ndraft model evolves online, it aligns with the query distribution in\nreal time, mitigating distribution shifts. We develop a prototype of\nonline speculative decoding based on online knowledge distillation and\nevaluate it using both synthetic and real query data on several\npopular LLMs. The results show a substantial increase in the token\nacceptance rate by 0.1 to 0.65, which translates into 1.22$\\times$ to\n3.06$\\times$ latency reduction. Code is available at\n" (2:1-26:1, 4-1603)
 │     ├─1 inlineCode "https://github.com/LiuXiaoxuanPKU/OSD" (26:3-26:42, 1605-1644)
 │     └─2 text ".\nauthor:" (26:42-27:8, 1644-1653)
 ├─2   list[2] (28:1-46:35, 1654-2187)
 │     │ ordered: false
 │     │ start: null
 │     │ spread: false
 │     ├─0 listItem[5] (28:1-44:14, 1654-2126)
 │     │   │ spread: true
 │     │   │ checked: null
 │     │   ├─0 paragraph[2] (28:3-30:44, 1656-1771)
 │     │   │   ├─0 text "|\n" (28:3-29:1, 1656-1658)
 │     │   │   └─1 strong[1] (29:3-30:42, 1660-1769)
 │     │   │       └─0 text "Xiaoxuan Liu$,:$$^{1}$ $\\qquad$ Lanxiang Hu$^{2}$$\\qquad$ Peter\nBailis$^{3}$$\\qquad$ Ion Stoica$^{1}$" (29:5-30:40, 1662-1767)
 │     │   ├─1 paragraph[2] (33:3-34:44, 1784-1893)
 │     │   │   ├─0 emphasis[2] (33:3-34:41, 1784-1890)
 │     │   │   │   ├─0 emphasis[1] (33:4-34:38, 1785-1887)
 │     │   │   │   │   └─0 text "Zhijie Deng$^{4}$$\\thanks{Corresponding author}$ $\\qquad$ Alvin\nCheung$^{1}$$\\qquad$ Hao Zhang$^{2" (33:5-34:37, 1786-1886)
 │     │   │   │   └─1 text "}$" (34:38-34:40, 1887-1889)
 │     │   │   └─1 text "*" (34:41-34:42, 1890-1891)
 │     │   ├─2 paragraph[1] (37:3-37:61, 1906-1964)
 │     │   │   └─0 text "$^{1}$ UC Berkeley$^{2}$ UCSD$^{3}$ Sisu Data$^{4}$ SJTU" (37:3-37:59, 1906-1962)
 │     │   ├─3 paragraph[1] (40:3-40:55, 1977-2029)
 │     │   │   └─0 inlineCode "{xiaoxuanliu, istoica, akcheung}@cs.berkeley.edu" (40:3-40:53, 1977-2027)
 │     │   └─4 paragraph[2] (43:3-44:14, 2042-2126)
 │     │       ├─0 inlineCode "{lah003, haozhang}@ucsd.edu, [email protected], [email protected]" (43:3-43:73, 2042-2112)
 │     │       └─1 text "\nbibliography:" (43:73-44:14, 2112-2126)
 │     └─1 listItem[1] (45:1-46:35, 2127-2187)
 │         │ spread: false
 │         │ checked: null
 │         └─0 paragraph[1] (45:3-46:35, 2129-2187)
 │             └─0 text "iclr2024_conference.bib\ntitle: Online Speculative Decoding" (45:3-46:35, 2129-2187)
 ├─3   thematicBreak (47:1-47:4, 2188-2191)
 ├─4   heading[1] (49:1-49:15, 2193-2207)
 │     │ depth: 1
 │     └─0 text "Introduction" (49:3-49:15, 2195-2207)
 ├─5   paragraph[1] (51:1-56:61, 2209-2622)
 │     └─0 text "Large language models (LLMs) such as GPT-4 , Claude , and Llama  are\nrapidly reinventing today’s applications. Many companies are racing to\ndeploy LLMs in their vertical domains, such as search, chatbots, and\nvirtual assistants. Since most of these applications demand low latency,\noptimizing LLM serving latency is of vital importance and can directly\ntranslate into better quality of service and cost reduction." (51:1-56:61, 2209-2622)
 ├─6   paragraph[1] (58:1-72:9, 2624-3598)
 │     └─0 text "The latency of today’s LLM service is unfortunately very high. This is\nprimarily because serving a user query requires multiple serial\nevaluations of the LLM, each generating only one token of the response.\nAn emerging solution to reduce the latency is speculative decoding.\nSpeculative decoding employs a smaller model to speculate multiple\noutput tokens of the target (large) model, then lets the target LLM\nverify these speculations in parallel. Then, if the verification of a\ntoken fails, the large model must recompute from that point. Therefore,\nthe performance of speculative decoding primarily depends on the\nspeculation accuracy of the small model. In the presence of diverse text\ninputs, the accuracy of existing speculative decoding methods is\nunfortunately not very high, due to the capability gap between the draft\nand target model. Employing a larger, more accurate model however\ndefeats the purpose of speculative decoding as it potentially increases\nlatency." (58:1-72:9, 2624-3598)
 ├─7   paragraph[15] (74:1-105:46, 3600-5745)
 │     ├─0  text "To address this challenge, we introduce a novel method, " (74:1-74:57, 3600-3656)
 │     ├─1  emphasis[1] (74:57-75:22, 3656-3685)
 │     │    └─0 text "online\nspeculative decoding" (74:58-75:21, 3657-3684)
 │     ├─2  text ", specifically designed for online LLM services.\nThe method leverages the abundant redundant compute, termed as “spare\nflops,” available in a typical LLM serving cluster to continuously\nretrain (multiple) small draft models through online learning on query\ndata posted to the LLM service. Our approach is simple and offers\nseveral significant advantages. First, user queries to a specific LLM\nservice often exhibit a common domain-specific distribution , reflecting\nshared usage patterns. While accurately speculating the larger model’s\noutputs on " (75:22-83:12, 3685-4233)
 │     ├─3  emphasis[1] (83:12-83:31, 4233-4252)
 │     │    └─0 text "any diverse input" (83:13-83:30, 4234-4251)
 │     ├─4  text " is challenging, it is feasible to enhance\nthe draft model’s prediction accuracy, " (83:31-84:40, 4252-4334)
 │     ├─5  emphasis[1] (84:40-85:16, 4334-4381)
 │     │    └─0 text "only for similar inputs posted\nto the service" (84:41-85:15, 4335-4380)
 │     ├─6  text ", characterized by the query distribution. This can be\nachieved by finetuning the draft model on user query distribution or\nfinetuning multiple draft models, each on a cluster of the query\ndistribution, and selecting the appropriately specialized draft model to\nspeculate based on the class of inputs they are trained on. As shown\nin §" (85:16-90:5, 4381-4716)
 │     ├─7  html "<a href=\"#sec:eval:online_evaluation\" data-reference-type=\"ref\"\ndata-reference=\"sec:eval:online_evaluation\">" (90:5-91:45, 4716-4824)
 │     ├─8  text "5.2" (91:45-91:48, 4824-4827)
 │     ├─9  html "</a>" (91:48-91:52, 4827-4831)
 │     ├─10 text ", we show that it is\npossible to train multiple draft models, each for a different language\nor topic. Second, the primary bottleneck for transformer-based LLM\ninference is the accelerator’s memory bandwidth, as generating each word\nrequires loading the model weights from HBM to SRAM as well as reading\nthe KV cache on all previous words. This results in a substantial amount\nof unused compute, especially during non-spike traffic hours , in an LLM\nserving cluster. We demonstrate that these spare FLOPs can be\neffectively repurposed for online retraining of draft models, with\ninconspicuous retraining cost\n(§" (91:52-101:3, 4831-5441)
 │     ├─11 html "<a href=\"#sec:analysis\" data-reference-type=\"ref\"\ndata-reference=\"sec:analysis\">" (101:3-102:31, 5441-5521)
 │     ├─12 text "4.2.2" (102:31-102:36, 5521-5526)
 │     ├─13 html "</a>" (102:36-102:40, 5526-5530)
 │     └─14 text "). Third, since tuning is\nperformed online, the draft models continuously evolve over time based\non the observed query data, which ensures high speculation accuracy even\nwhen faced with shifts in query distribution." (102:40-105:46, 5530-5745)
 ├─8   paragraph[1] (107:1-115:48, 5747-6351)
 │     └─0 text "Based on these insights, we develop an online speculative decoding\nframework to improve the efficiency of online LLM serving. To align the\ndraft model with the target model on a newly observed user query, we\ndevelop a new online learning algorithm based on Generalized Knowledge\nDistillation (GKD) . The algorithm keeps track of the recent queries\nthat the draft model has speculated incorrectly, and forces the draft\nmodel to emulate the target model’s outputs on these queries. The\nalgorithm performs GKD-based gradient update opportunistically only when\nspare flops are available, hiding the overhead." (107:1-115:48, 5747-6351)
 ├─9   html "<figure id=\"fig:arch\">\n<embed src=\"figures/arch.pdf\" />\n<figcaption>Online speculative decoding overview. For each prompt, the\ndraft model suggests multiple tokens in a single step. The target model\nthen verifies these tokens, accepting some and rejecting others. If the\nstudent proposes incorrect tokens, both the draft and target\ndistributions are stored in a buffer. Once the buffer exceeds a\nspecified threshold, the draft model is updated by calculating the loss\nbetween the draft and target distributions using various distance\nmetrics.</figcaption>\n</figure>" (117:1-127:10, 6353-6918)
 ├─10  paragraph[1] (129:1-129:58, 6920-6977)
 │     └─0 text "In summary, this paper makes the following contributions:" (129:1-129:58, 6920-6977)
 ├─11  list[3] (131:1-145:10, 6979-7784)
 │     │ ordered: false
 │     │ start: null
 │     │ spread: true
 │     ├─0 listItem[1] (131:1-133:26, 6979-7147)
 │     │   │ spread: false
 │     │   │ checked: null
 │     │   └─0 paragraph[1] (131:3-133:26, 6981-7147)
 │     │       └─0 text "We introduce online speculative decoding to reduce LLM serving latency\nby adapting (multiple) draft models on the fly using query data and\nknowledge distillation." (131:3-133:26, 6981-7147)
 │     ├─1 listItem[1] (135:1-137:67, 7149-7350)
 │     │   │ spread: false
 │     │   │ checked: null
 │     │   └─0 paragraph[1] (135:3-137:67, 7151-7350)
 │     │       └─0 text "We explore various GKD methods for constructing draft models and\nidentify the most effective variants, suggesting them as superior\nalternatives to existing finetuning methods in offline settings." (135:3-137:67, 7151-7350)
 │     └─2 listItem[1] (139:1-145:10, 7352-7784)
 │         │ spread: false
 │         │ checked: null
 │         └─0 paragraph[1] (139:3-145:10, 7354-7784)
 │             └─0 text "Our method demonstrates a significant improvement in token acceptance\nrate by 10-65% on diverse datasets, translating to 1.2-3.1$\\times$\nreduction in latency theoretically, with a negligible additional cost.\nIt surpasses existing methods which construct static draft models\nusing fine-tuning or distillation on offline datasets, and matches the\nhypothetical accuracy achieved if all query data were available a\npriori." (139:3-145:10, 7354-7784)
 ├─12  heading[1] (147:1-147:15, 7786-7800)
 │     │ depth: 1
 │     └─0 text "Related Work" (147:3-147:15, 7788-7800)
 ├─13  paragraph[1] (149:1-153:70, 7802-8151)
 │     └─0 text "LLMs have become pervasive in today’s AI applications, underscoring the\nimportance of optimizing LLM inference. Numerous system optimizations\nhave been developed to optimize the throughput of LLM serving . This\npaper particularly concentrates on a significant strand of research,\nspeculative decoding, aimed at reducing the latency of LLM inference." (149:1-153:70, 7802-8151)
 ├─14  paragraph[2] (155:1-173:28, 8153-9433)
 │     ├─0 strong[1] (155:1-155:26, 8153-8178)
 │     │   └─0 text "Speculative decoding." (155:3-155:24, 8155-8176)
 │     └─1 text " Speculative decoding  accelerates LLM decoding\nby employing a (small) draft model to predict the outputs of the larger\ntarget model, which are then verified by the target model. Typically,\nthe draft model, while having fewer parameters, is pretrained using the\nsame training data as the target mode, resulting in a negotiable\ninference cost but with compromised capability. If the draft model can\ncorrectly predict more than one token per verification step, the memory\nI/O for accessing the model weights and KV cache at inference is\namortized across multiple output tokens, thereby reduces latency,\nespecially since LLM inference is often constrained by GPU HBM\nbandwidth. The efficacy of speculative decoding largely hinges on the\ndraft model’s ability to accurately predict the target model’s outputs.\nExisting work improves the speculation accuracy by using multiple\ncollectively boosted  or staged  draft models, or retraining the target\nmodel with auxiliary prediction heads as a draft model . These methods\npredominantly assume a static draft model post-deployment. In contrast,\nour work introduces a framework that actively adapts the draft model to\nthe evolving user query distribution on the fly, irrespective of the\ndraft model’s construction." (155:26-173:28, 8178-9433)
 ├─15  paragraph[2] (175:1-187:10, 9435-10295)
 │     ├─0 strong[1] (175:1-175:45, 9435-9479)
 │     │   └─0 text "Distillation for auto-regressive models." (175:3-175:43, 9437-9477)
 │     └─1 text " Knowledge distillation (KD)\nis a framework to generate smaller models that emulate the performance\nof larger models. However, KD in its conventional form has been observed\nto be less effective for LLMs. extend KD to autoregressive LLMs by\ndecoding from the student model and optimizing the reserve KL divergence\nbetween students and teachers. Further, introduce generalized knowledge\ndistillation (GKD) to optimize a linear combination of the forward KL\nand reverse KL between teacher and student, using a blend of teacher-\nand student-sampled data. Drawing inspiration from both works, our paper\napplies KD to speculative decoding for LLMs. We empirically determine\nthe most effective KD variant for maximizing the draft model’s accuracy,\nand extend it to dynamically generate draft models for online LLM\nservices." (175:45-187:10, 9479-10295)
 ├─16  heading[1] (189:1-189:13, 10297-10309)
 │     │ depth: 1
 │     └─0 text "Background" (189:3-189:13, 10299-10309)
 ├─17  paragraph[1] (191:1-198:46, 10311-10828)
 │     └─0 text "We first briefly review speculative decoding , a critical technique that\naccelerates inference of a large target LLM $p(\\cdot|{\\bm{x}})$ with\ntoken proposals from a small draft model\n$q_{\\bm{\\theta}}(\\cdot|{\\bm{x}})$. ${\\bm{x}}$ denotes the concatenation\nof the input prompt and already generated tokens. The two distributions\nare both auto-regressive. We emphasize the parameters ${\\bm{\\theta}}$ of\nthe draft model because we usually need to tailor them according to the\ntarget LLM for more substantial acceleration." (191:1-198:46, 10311-10828)
 ├─18  paragraph[13] (200:1-215:32, 10830-11813)
 │     ├─0  text "Speculative decoding uses a (small) draft model to propose $k$ tokens\n${{\\bm{y}}} \\triangleq { y_i}" (200:1-201:32, 10830-10931)
 │     ├─1  emphasis[1] (201:32-201:48, 10931-10947)
 │     │    └─0 text "{i=1}^k \\sim q" (201:33-201:47, 10932-10946)
 │     ├─2  text "{\\bm{\\theta}}(\\cdot | {\\bm{x}})$,\nand let the target LLM estimate the $k+1$ probabilities,\n${p({y}|{\\bm{x}}, {{\\bm{y}}}" (201:48-203:30, 10947-11067)
 │     ├─3  emphasis[1] (203:30-203:39, 11067-11076)
 │     │    └─0 text "{<i})}" (203:31-203:38, 11068-11075)
 │     ├─4  text "{i=1}^{k+1}$[^1], in parallel.\nWith $i$ rising from $1$ to $k$, speculative decoding accepts the\nproposal ${y}" (203:39-205:14, 11076-11186)
 │     ├─5  emphasis[1] (205:14-206:37, 11186-11229)
 │     │    └─0 text "i$ if\n$u \\leq  p(y_i|{\\bm{x}}, {{\\bm{y}}}" (205:15-206:36, 11187-11228)
 │     ├─6  text "{<i}) / q_{\\bm{\\theta}}({y}" (206:37-206:64, 11229-11256)
 │     ├─7  emphasis[1] (206:64-206:88, 11256-11280)
 │     │    └─0 text "i|{\\bm{x}}, {{\\bm{y}}}" (206:65-206:87, 11257-11279)
 │     ├─8  text "{<i})$\nwhere $u \\sim U[0,1]$; otherwise exits. Let $a$ denote the number of\naccepted tokens, which takes values in ${0,\\dots, k}$. We can sample\nan additional token ${y}" (206:88-209:25, 11280-11451)
 │     ├─9  emphasis[1] (209:25-212:32, 11451-11550)
 │     │    └─0 text "{a+1}$ from the following distribution\n$$p'(y) =\n\\begin{cases}\np(y|{\\bm{x}}, {{\\bm{y}}}" (209:26-212:31, 11452-11549)
 │     ├─10 text "{<a+1}) & \\text{if $a = k$}\\\n\\mathrm{norm}(\\max(0, p(y|{\\bm{x}}, {{\\bm{y}}}" (212:32-213:53, 11550-11632)
 │     ├─11 emphasis[1] (213:53-213:66, 11632-11645)
 │     │    └─0 text "{<a+1}) - q" (213:54-213:65, 11633-11644)
 │     └─12 text "{\\bm{\\theta}}(y|{\\bm{x}}, {{\\bm{y}}}_{<a+1}))) & \\text{otherwise}\n\\end{cases}$$ where $\\mathrm{norm}(\\cdot)$ makes the probabilities\nover the vocabulary sum to $1$." (213:66-215:32, 11645-11813)
 ├─19  paragraph[3] (217:1-225:32, 11815-12393)
 │     ├─0 text "Prior work has shown that the resulting samples\n$\\tilde{{\\bm{y}}} \\triangleq {{y}" (217:1-218:35, 11815-11897)
 │     ├─1 emphasis[1] (218:35-218:48, 11897-11910)
 │     │   └─0 text "1, \\dots, y" (218:36-218:47, 11898-11909)
 │     └─2 text "{a+1}}$ strictly follow\nthe distribution of the target LLM $p(\\cdot|{\\bm{x}})$ . We concatenate\n$\\tilde{{\\bm{y}}}$ to ${\\bm{x}}$ and repeat the above process until\nmeeting ⟨EOS⟩. Each run of the target LLM generates $a+1$ tokens with\n$a\\geq0$. This ensures that at least one new token is generated even in\nthe worst case. The generation process can be significantly accelerated\nif the draft LLM better approximates the target one, particularly $a$ is\nlarger for each target LLM run." (218:48-225:32, 11910-12393)
 ├─20  paragraph[2] (227:1-234:57, 12395-12911)
 │     ├─0 strong[1] (227:1-227:40, 12395-12434)
 │     │   └─0 text "Expected acceptance rate & speedup." (227:3-227:38, 12397-12432)
 │     └─1 text " The acceptance rate, denoted as\n$\\alpha$, serves as a measure of how closely the draft model\napproximates the target model. It is defined as the expected probability\nthat speculative decoding will accept a proposal token given the prompt\n$y_i \\sim q_{\\bm{\\theta}}(y_i|{\\bm{x}}, {{\\bm{y}}}_{<i})$. This rate\ndirectly influences the expected length\n($\\mathbb{E}(|\\tilde{{\\bm{y}}}|)$) of $\\tilde{{\\bm{y}}}$ for each target\nLLM run and the speedup brought by speculative decoding." (227:40-234:57, 12434-12911)
 ├─21  paragraph[5] (236:1-246:49, 12913-13684)
 │     ├─0 text "Assuming that the $k + 1$ simultaneous evaluations of the target LLM $p$\ntake roughly the same amount of time as generating a single token in\nparallel, let $c$ be the time ratio for a single run between\n$q_{\\bm{\\theta}}$ and $p$. The expected generation length of a single\ntarget LLM run and the speedup in the total wall time due to speculative\ndecoding is represented as : $$\\label{eq:gen_len}\n\\mathbb{E}(|\\tilde{{\\bm{y}}}|) = \\frac{1 - \\alpha^{k+1}}{1-\\alpha},\\quad \\mathbb{E}(speedup)=\\frac{1-\\alpha^{k+1}}{(1-\\alpha)(kc+1)}.$$\nWe depict the speedup for varying values of $\\alpha$ in\nFigure " (236:1-244:8, 12913-13512)
 │     ├─1 html "<a href=\"#fig:analysis-alphas\" data-reference-type=\"ref\"\ndata-reference=\"fig:analysis-alphas\">" (244:8-245:38, 13512-13606)
 │     ├─2 text "2" (245:38-245:39, 13606-13607)
 │     ├─3 html "</a>" (245:39-245:43, 13607-13611)
 │     └─4 text ", which demonstrates the\nimportance of $\\alpha$ in affecting the speedup." (245:43-246:49, 13611-13684)
 ├─22  html "<figure id=\"fig:analysis-alphas\">\n<p><embed src=\"figures/analysis_k.pdf\" /> <embed\nsrc=\"figures/analysis_c.pdf\" /></p>\n<figcaption>Speculative decoding speedups for varying values of <span\nclass=\"math inline\"><em>α</em></span> in Figure <a\nhref=\"#fig:analysis-alphas\" data-reference-type=\"ref\"\ndata-reference=\"fig:analysis-alphas\">2</a>. For smaller <span\nclass=\"math inline\"><em>α</em></span> values, speculative decoding may\neven degrade performance (indicated by a speedup <span\nclass=\"math inline\"> &lt; 1</span>), particularly when the draft model\nis sizeable. Furthermore, the relationship between speedup and <span\nclass=\"math inline\"><em>α</em></span> is superlinear; doubling the\nacceptance rate can yield a speedup exceeding 2<span\nclass=\"math inline\">×</span>.</figcaption>\n</figure>" (248:1-262:10, 13686-14480)
 ├─23  paragraph[4] (264:1-279:10, 14482-15543)
 │     ├─0 strong[1] (264:1-264:17, 14482-14498)
 │     │   └─0 text "Observation." (264:3-264:15, 14484-14496)
 │     ├─1 text " Interestingly, we can actually enhance $\\alpha$ based\non a key observation: the speculative decoding process inherently\nidentifies the inaccuracies of the small draft LLM and offers correct\nsolutions for these inaccuracies. This essentially means that we receive\nvaluable insights on the areas and strategies to refine the draft model\nat no additional cost. Viewed through the lens of online learning, we\ncan effortlessly accumulate a set of input-output pairs, denoted as\n$([{\\bm{x}}, {\\bm{y}}" (264:17-271:22, 14498-14993)
 │     ├─2 emphasis[1] (271:22-271:57, 14993-15028)
 │     │   └─0 text "{<a+1}], p(y|{\\bm{x}}, {{\\bm{y}}}" (271:23-271:56, 14994-15027)
 │     └─3 text "{<a+1}))$, that\nhave yet to be assimilated by the draft LLM, paving the way for its\nsubsequent optimization. Given the reduced size of the draft model (for\ninstance, it may be over $20\\times$ smaller than the target model), its\ntuning is not only efficient but also viable for real-time online\nadjustments. Prior work  has primarily approached speculative decoding\nin an offline manner, meaning the draft model remains static during\nonline deployment. We next develop online speculative decoding to bridge\nthis gap." (271:57-279:10, 15028-15543)
 ├─24  heading[1] (281:1-281:30, 15545-15574)
 │     │ depth: 1
 │     └─0 text "Online Speculative Decoding" (281:3-281:30, 15547-15574)
 ├─25  paragraph[1] (283:1-288:32, 15576-15962)
 │     └─0 text "We propose the online speculative decoding approach to update the draft\nmodel dynamically for more effective suggestions. We frame the learning\nproblem based on the aforementioned auxiliary information as online\nknowledge distillation, where the teacher and student models correspond\nto the target and draft LLMs in speculative decoding, respectively. We\nelaborate on the details below." (283:1-288:32, 15576-15962)
 ├─26  heading[1] (290:1-290:51, 15964-16014)
 │     │ depth: 2
 │     └─0 text "Knowledge Distillation for Speculative Decoding" (290:4-290:51, 15967-16014)
 ├─27  paragraph[1] (292:1-311:65, 16016-17444)
 │     └─0 text "Knowledge distillation is a general framework to align the predictive\ndistribution of a small model (i.e., student model) with that of a\nlarger one (i.e., teacher model). Prior research has utilized knowledge\ndistillation to compress neural networks, resulting in decreased\ninference costs and memory requirements. We posit that knowledge\ndistillation is highly effective for speculative decoding. In this\napproach, the draft model acts as the student and the target model\nserves as the teacher. During speculative decoding, we possess complete\ninformation on both the proposed and verified probabilities of each\ntoken. This information helps to construct objectives for distilling the\ndraft model, aligning its output distributions with those of the target\nmodel and thereby improving the token acceptance rate of the draft\nmodel. The distillation loss generally takes the form of:\n$$\\label{eq:distill}\n\\small\n\\begin{aligned}\n\\ell({\\bm{\\theta}}) &= \\frac{1}{n_B}\\sum_{{\\bm{x}}^{(i)} \\in \\mathcal{B}} \\ell({\\bm{x}}^{(i)}, {\\bm{\\theta}}), \\quad \\ell({\\bm{x}}, {\\bm{\\theta}}) =  D ({p(\\cdot|{\\bm{x}})} \\Vert {q_{\\bm{\\theta}}(\\cdot|{\\bm{x}})} ),% \\\n% &= \\frac{1}{n_B}\\sum_{\\vx \\in \\mathcal{B}} \\sum_{t=1} \\KL(q_\\vtheta(y_t|\\vx, \\vy_{<t}) \\Vert p(y_t|\\vx, \\vy_{<t})) \\\n\\end{aligned}$$ where $\\mathcal{B} = {{\\bm{x}}^{(i)}}_{i=1}^{n_B}$\ndenotes a batch of inputs and $D$ denotes some distance measure." (292:1-311:65, 16016-17444)
 ├─28  paragraph[10] (313:1-338:44, 17446-19459)
 │     ├─0 strong[1] (313:1-313:22, 17446-17467)
 │     │   └─0 text "Distance measure." (313:3-313:20, 17448-17465)
 │     ├─1 text " In the case of auto-regressive models, the\nprediction distribution is categorical at each token. Often, we can\naugment the predicted logits with a tunable temperature $\\tau$ for\nsoftmax transformation. We then use the popular forward KL and reverse\nKL (RKL), as well as their mixture (i.e., the JSD divergence) to\ninstantiate $D$ : $$\\small\n\\begin{aligned}\n&\\ell_{KL}({\\bm{x}}, {\\bm{\\theta}}) = D_{\\mathrm{KL}}( {p(\\cdot|{\\bm{x}})}\\Vert {q_{\\bm{\\theta}}(\\cdot|{\\bm{x}})}), \\\n&\\ell_{RKL}({\\bm{x}}, {\\bm{\\theta}}) = D_{\\mathrm{KL}}({q_{\\bm{\\theta}}(\\cdot|{\\bm{x}})} \\Vert {p(\\cdot|{\\bm{x}})}), \\\n&\\ell_{{JSD}[\\beta]} ({\\bm{x}}, {\\bm{\\theta}}) = \\beta D_{\\mathrm{KL}}\\left({p(\\cdot|{\\bm{x}})} \\big\\Vert {p}^\\beta_{\\bm{\\theta}}(\\cdot|{\\bm{x}})\\right)+ (1-\\beta) D_{\\mathrm{KL}}\\left({q_{\\bm{\\theta}}(\\cdot|{\\bm{x}})} \\big\\Vert {p}^\\beta_{\\bm{\\theta}}(\\cdot|{\\bm{x}})\\right),\n\\end{aligned}$$ where\n${p}^\\beta_{\\bm{\\theta}}(\\cdot|{\\bm{x}}) \\triangleq \\beta{p(\\cdot|{\\bm{x}})} + (1-\\beta){q_{\\bm{\\theta}}(\\cdot|{\\bm{x}})}$.\nThese objectives diverge from the conventionally used label-based\nfine-tuning objectives in speculative decoding, as highlighted in . As\nshown in Section " (313:22-327:18, 17467-18673)
 │     ├─2 html "<a href=\"#sec:offline-eval\" data-reference-type=\"ref\"\ndata-reference=\"sec:offline-eval\">" (327:18-328:35, 18673-18761)
 │     ├─3 text "5.1" (328:35-328:38, 18761-18764)
 │     ├─4 html "</a>" (328:38-328:42, 18764-18768)
 │     ├─5 text ", objectives based on the KL\ndivergence prove to be more effective. This is because distributions\nconvey richer information than mere labels, thereby enhancing their\ncapability to guide the student model . Additionally, these objectives\nenhance convergence rates  and bolster calibration. The reverse KL is\nhighlighted for its mode-seeking behavior, offering unique advantages .\nIn our study, and in alignment with previous research , we empirically\ndetermine that the optimal distance measure can vary depending on the\ntasks and the relative capacities of the teacher and student models (see\n§" (328:42-337:2, 18768-19362)
 │     ├─6 html "<a href=\"#sec:offline-eval\" data-reference-type=\"ref\"\ndata-reference=\"sec:offline-eval\">" (337:2-338:35, 19362-19450)
 │     ├─7 text "5.1" (338:35-338:38, 19450-19453)
 │     ├─8 html "</a>" (338:38-338:42, 19453-19457)
 │     └─9 text ")." (338:42-338:44, 19457-19459)
 ├─29  paragraph[2] (340:1-346:55, 19461-19925)
 │     ├─0 strong[1] (340:1-340:38, 19461-19498)
 │     │   └─0 text "Sampling and gradient estimation." (340:3-340:36, 19463-19496)
 │     └─1 text " Estimating the above objectives\ninvolves the expectation over $q_{\\bm{\\theta}}(\\cdot|{\\bm{x}})$ or\n$p(\\cdot|{\\bm{x}})$, which should be expanded recursively. Once the\nrecursion depth exceeds $1$, we can not analytically compute\n$D_{\\mathrm{KL}}$ but hinge on Monte Carlo approximation. When sampling\nfrom $q_{\\bm{\\theta}}(\\cdot|{\\bm{x}})$, we should differentiate through\nthe sampling process for unbiased gradient estimation." (340:38-346:55, 19498-19925)
 ├─30  paragraph[1] (348:1-350:68, 19927-20127)
 │     └─0 text "However, this leads to policy gradient-style estimators and should rely\non elaborate policies such as reward hacking and single-step\nregularization to reduce gradient variance and stabilize training ." (348:1-350:68, 19927-20127)
 ├─31  paragraph[5] (352:1-362:28, 20129-20807)
 │     ├─0 text "In comparison, a more straightforward approach is to omit the\ndifferentiation through the sampling process , where the sample\n${\\bm{y}}$ is directly plugged into the objective: $$\\label{eq:offline}\n\\small\n\\ell({\\bm{x}}, {\\bm{\\theta}}) \\approx\n\\sum_{j =1}^{|{\\bm{y}}|+1} D({p(y|{\\bm{x}}, {\\bm{y}}" (352:1-357:54, 20129-20433)
 │     ├─1 emphasis[1] (357:54-357:71, 20433-20450)
 │     │   └─0 text "{<j})} \\Vert {q" (357:55-357:70, 20434-20449)
 │     ├─2 text "{\\bm{\\theta}}(y|{\\bm{x}}, {\\bm{y}}" (357:71-357:105, 20450-20484)
 │     ├─3 emphasis[1] (357:105-361:6, 20484-20714)
 │     │   └─0 text "{<j})} ).$$\nThis way, various distance measures can be readily applied. Besides, the\nsampling becomes disentangled from the distance measure. i.e., we sample\n${\\bm{y}}$ from an arbitrary mixture of ${p}(\\cdot|{\\bm{x}})$ and\n${q}" (357:106-361:5, 20485-20713)
 │     └─4 text "\\theta(\\cdot|{\\bm{x}})$ but use KL, RKL or JSD for estimating the\ndistribution mis-alignment." (361:6-362:28, 20714-20807)
 ├─32  paragraph[3] (364:1-369:111, 20809-21213)
 │     ├─0 text "Intuitively, the samples from the teacher model are usually coherent,\nwhich may raise difficulties in fitting the small student model, while\nsamples from the student model may be less structured or even\nmeaningless. A workaround strategy is to trade off between them via\nmixed sampling , i.e.,\n$y_j \\sim \\beta{p(\\cdot|{\\bm{x}}, {\\bm{y}}" (364:1-369:43, 20809-21145)
 │     ├─1 emphasis[1] (369:43-369:65, 21145-21167)
 │     │   └─0 text "{<j})} + (1-\\beta) q" (369:44-369:64, 21146-21166)
 │     └─2 text "{\\bm{\\theta}}(\\cdot|{\\bm{x}}, {\\bm{y}}_{<j})$." (369:65-369:111, 21167-21213)
 ├─33  heading[1] (371:1-371:33, 21215-21247)
 │     │ depth: 2
 │     └─0 text "Online Knowledge Distillation" (371:4-371:33, 21218-21247)
 ├─34  paragraph[1] (373:1-378:47, 21249-21636)
 │     └─0 text "This section expands the application of knowledge distillation for\nspeculative decoding in online environments. The approach enables\nimproving the performance of draft model using results from speculative\ndecoding, thus dynamically adapting to the query distribution and\nimproving token acceptance rate. We also discuss the trade-off of our\napproach when integrating LLM serving systems." (373:1-378:47, 21249-21636)
 ├─35  heading[1] (380:1-380:14, 21638-21651)
 │     │ depth: 3
 │     └─0 text "Algorithm" (380:5-380:14, 21642-21651)
 ├─36  html "<div class=\"algorithm\">" (382:1-382:24, 21653-21676)
 ├─37  html "<div class=\"algorithmic\">" (384:1-384:26, 21678-21703)
 ├─38  paragraph[1] (386:1-390:17, 21705-21965)
 │     └─0 text "Target LLM $p(\\cdot|{\\bm{x}})$, draft LLM\n$q_{\\bm{\\theta}}(\\cdot|{\\bm{x}})$, warmup dataset $\\mathcal{D}$, online\ndata stream $\\mathcal{S}$, guess number $k$, temporary buffer\n$\\mathcal{R}$, replay buffer $\\mathcal{Q}$, update interval for the\ndraft model $I$." (386:1-390:17, 21705-21965)
 ├─39  html "</div>" (392:1-392:7, 21967-21973)
 ├─40  html "</div>" (394:1-394:7, 21975-21981)
 ├─41  paragraph[9] (396:1-411:26, 21983-22990)
 │     ├─0 text "We depict our online speculative decoding algorithm (OSD) in\n" (396:1-397:1, 21983-22044)
 │     ├─1 html "<a href=\"#algo:1\" data-reference-type=\"ref\"\ndata-reference=\"algo:1\">" (397:1-398:25, 22044-22112)
 │     ├─2 text "[algo:1]" (398:25-398:33, 22112-22120)
 │     ├─3 html "</a>" (398:33-398:37, 22120-22124)
 │     ├─4 text ". OSD begins by training the draft\nmodel using the warmup dataset (Line 2). The serving system then\ncontinuously handles incoming requests (as described in Lines 6 to 23).\nFor each request, it uses standard speculative decoding (Lines 10-11) to\ngenerate responses until the ⟨EOS⟩ token. Concurrently, OSD tracks the\ntoken index ($error_index$) and target logits where the draft model\nproposes the wrong tokens (Line 15). Leveraging tracked information, OSD\nupdates the draft model every $I$ iteration, with $I$ being a\ndynamically adjustable parameter. OSD updates the draft model with\ndifferent loss functions (Line 20) as described in\nSection " (398:37-408:9, 22124-22770)
 │     ├─5 html "<a href=\"#sec:knowledge-distill\" data-reference-type=\"ref\"\ndata-reference=\"sec:knowledge-distill\">" (408:9-409:40, 22770-22868)
 │     ├─6 text "4.1" (409:40-409:43, 22868-22871)
 │     ├─7 html "</a>" (409:43-409:47, 22871-22875)
 │     └─8 text ". The choice of loss\nfunction depends on the specific (draft, target) model pairs and the\ncorresponding input data." (409:47-411:26, 22875-22990)
 ├─42  paragraph[2] (413:1-430:35, 22992-24221)
 │     ├─0 strong[1] (413:1-413:16, 22992-23007)
 │     │   └─0 text "Discussion." (413:3-413:14, 22994-23005)
 │     └─1 text " OSD utilizes a replay buffer, $\\mathcal{Q}$, to capture\nall pertinent information for updating the draft model. Various eviction\npolicies can be employed to maintain a compact size for $\\mathcal{Q}$.\nFor example, one could opt to retain only the most informative pairs or\nthe most recent entries. Similarly, users have the option to retain data\nin $\\mathcal{Q}$ even after utilizing it to update the model multiple\ntimes. Determining the optimal eviction/retention strategy is a subject\nfor future exploration. In the current study, we refrain from evicting\nany pairs and release $\\mathcal{Q}$ after each model update.\nFurthermore, $I$ is a dynamic parameter. Depending on the system load\nand the rate at which the query distribution changes, users can adjust\n$I$ accordingly. For example, we can perform a gradient update\nopportunistically only when the service traffic is not on spike (i.e.,\nspare flops are available). Overall, OSD continuously improves the draft\nmodel’s approximation (indicated by increased token acceptance rate\n$\\alpha$) by learning from the target model during the serving phase. We\nnext demonstrate how the enhanced acceptance rate directly contributes\nto a reduction in request latency." (413:16-430:35, 23007-24221)
 ├─43  heading[1] (432:1-432:29, 24223-24251)
 │     │ depth: 3
 │     └─0 text "Latency & Flops Analysis" (432:5-432:29, 24227-24251)
 ├─44  paragraph[10] (434:1-447:67, 24253-25114)
 │     ├─0 strong[1] (434:1-434:13, 24253-24265)
 │     │   └─0 text "Latency." (434:3-434:11, 24255-24263)
 │     ├─1 text " As detailed in\nAppendix " (434:13-435:10, 24265-24290)
 │     ├─2 html "<a href=\"#appendix:latency-analysis\" data-reference-type=\"ref\"\ndata-reference=\"appendix:latency-analysis\">" (435:10-436:44, 24290-24396)
 │     ├─3 text "7.2" (436:44-436:47, 24396-24399)
 │     ├─4 html "</a>" (436:47-436:51, 24399-24403)
 │     ├─5 text ", compared with\nstandard speculative decoding, the expected speedup for online\nspeculative decoding is\n$\\frac{1+\\alpha_2+\\alpha_2^2+...+\\alpha_2^{k}}{1+\\alpha_1+\\alpha_1^2+...+\\alpha_1^k}$.\nBased on the data from our experiment (refer to\nTable " (436:51-441:7, 24403-24647)
 │     ├─6 html "<a href=\"#tab:apha\" data-reference-type=\"ref\"\ndata-reference=\"tab:apha\">" (441:7-442:27, 24647-24719)
 │     ├─7 text "1" (442:27-442:28, 24719-24720)
 │     ├─8 html "</a>" (442:28-442:32, 24720-24724)
 │     └─9 text "), when compared to standard speculative\ndecoding, we expect a speedup improvement for Vicuna-7B (LLaMA-160M as\nthe draft model) by factors of $2.42\\times$, $1.43\\times$, $1.64\\times$,\nand $1.22\\times$. Similarly, for Flan-T5-XL 3B (T5-small 80M as the\ndraft model), the speedup enhancements are $3.06\\times$, $1.76\\times$,\n$2.72\\times$, and $1.55\\times$ across the four evaluated datasets." (442:32-447:67, 24724-25114)
 ├─45  paragraph[10] (449:1-464:48, 25116-26135)
 │     ├─0 strong[1] (449:1-449:11, 25116-25126)
 │     │   └─0 text "FLOPs." (449:3-449:9, 25118-25124)
 │     ├─1 text " (1) The FLOPs required to update the draft model are\nsignificantly fewer than those needed for inference on a large model. As\nelaborated in\nAppendix " (449:11-452:10, 25126-25276)
 │     ├─2 html "<a href=\"#appendix:flops\" data-reference-type=\"ref\"\ndata-reference=\"appendix:flops\">" (452:10-453:33, 25276-25360)
 │     ├─3 text "7.3" (453:33-453:36, 25360-25363)
 │     ├─4 html "</a>" (453:36-453:40, 25363-25367)
 │     ├─5 text ", for the two evaluated model\npairs, the FLOPs ratio between the target model and the draft model is\n18.75 for the pair (LLaMA-160M, Vicuna7B), and 12.6 for the pair\n(T5-small 80M, Flan-T5-XL 3B). (2) In practical systems, the FLOPs\nrequired for inference are significantly below the machine’s capacity.\nThe Appendix " (453:40-458:14, 25367-25684)
 │     ├─6 html "<a href=\"#appendix:flops\" data-reference-type=\"ref\"\ndata-reference=\"appendix:flops\">" (458:14-459:33, 25684-25768)
 │     ├─7 text "7.3" (459:33-459:36, 25768-25771)
 │     ├─8 html "</a>" (459:36-459:40, 25771-25775)
 │     └─9 text " provides an analysis of Arena\nchatbot traces where the cluster’s computational utilization is under 1\npercent. Given the above two observations, it becomes evident that the\nFLOPs spent on inference and updating the draft model are relatively\ninsignificant when juxtaposed with the FLOPs consumed while operating\nthe target model and the cluster’s total FLOPs." (459:40-464:48, 25775-26135)
 ├─46  heading[1] (466:1-466:14, 26137-26150)
 │     │ depth: 1
 │     └─0 text "Experiments" (466:3-466:14, 26139-26150)
 ├─47  paragraph[1] (468:1-483:57, 26152-27258)
 │     └─0 text "To assess the efficacy of our method, we initially evaluate its ability\nto improve the token acceptance rate ($\\alpha$) within an offline\ncontext. This provides us with a theoretical upper bound on the\nperformance improvements achievable when the query distribution remains\nconstant. Subsequently, we examine the approach’s impact in an online\nenvironment, discovering that the acceptance rate improves even with a\nmoderate amount of data while maintaining accuracy levels comparable to\nthose in the offline scenario. Throughout our experiments, we employ two\ntarget models ($M_p)$: Vicuna-7B  and FLAN-T5-XL (3B) . Specifically for\nVicuna-7B, we utilize LLaMA-160m  as the draft model ($M_q$). For\nFLAN-T5-XL, we use T5-Small  as the draft model. We evaluate performance\nacross four diverse datasets: Text-to-SQL (Spider) , graduate school\nmath (Gsm8k) , Python code generation (Code-search-Python) , and\nfinancial question answering (Alpaca-finance) . In all experiments, we\nset the number of proposed tokens to 5 for speculative decoding. For all\nonline experiments, we fix the update interval $I$ at 8." (468:1-483:57, 26152-27258)
 ├─48  heading[1] (485:1-485:22, 27260-27281)
 │     │ depth: 2
 │     └─0 text "Offline Evaluation" (485:4-485:22, 27263-27281)
 ├─49  paragraph[9] (487:1-503:64, 27283-28358)
 │     ├─0 text "In this section, we assess the efficacy of employing knowledge\ndistillation to train a small model specifically for speculation in an\noffline environment. In such a setting, the speculative $M_q$ model has\nunrestricted access to the dataset, and the query distribution remains\nstable. To emulate these offline conditions, we distill the $M_q$ using\nthe training dataset for two epochs and subsequently evaluate its\nperformance by measuring the average token acceptance rate ($\\alpha$) on\nthe test set. As detailed in\nSection " (487:1-495:9, 27283-27808)
 │     ├─1 html "<a href=\"#sec:knowledge-distill\" data-reference-type=\"ref\"\ndata-reference=\"sec:knowledge-distill\">" (495:9-496:40, 27808-27906)
 │     ├─2 text "4.1" (496:40-496:43, 27906-27909)
 │     ├─3 html "</a>" (496:43-496:47, 27909-27913)
 │     ├─4 text ", we evaluated various\nsampling methods, namely teacher sampling, student sampling, and mix\ntoken-level sampling.\nTable " (496:47-499:7, 27913-28033)
 │     ├─5 html "<a href=\"#tab:apha\" data-reference-type=\"ref\"\ndata-reference=\"tab:apha\">" (499:7-500:27, 28033-28105)
 │     ├─6 text "1" (500:27-500:28, 28105-28106)
 │     ├─7 html "</a>" (500:28-500:32, 28106-28110)
 │     └─8 text " displays the token acceptance rate of\nthe draft model for each method, using forward KL as the distance metric\non the test dataset. For comparison, we also provide the acceptance rate\nfor teacher-generated label fine-tuning and the original model." (500:32-503:64, 28110-28358)
 ├─50  paragraph[1] (505:1-516:27, 28360-29145)
 │     └─0 text "For both the Vicuna-7B and FLAN-T5-XL models, the teacher sampling\nmethod outperforms others by achieving the highest acceptance rate.\nFurthermore, knowledge distillation has proven its efficacy in enhancing\nthe draft model’s approximation, resulting in a high token acceptance\nrate. Intriguingly, we also find that fine-tuning with teacher-generated\nlabels yields impressive performance on the Vicuna-7B model. Lastly, we\nexperimented with different distance measurements like reverse KL and\nJSD. Nevertheless, these measurements either paralleled or\nunderperformed when compared to forward KL. Such empirical findings\nunderscore that the optimal distance measurement or sampling method\nvaries depending on the task and model, and we leave to future work to\nfind the best combination." (505:1-516:27, 28360-29145)
 ├─51  html "<div class=\"small\">" (518:1-518:20, 29147-29166)
 ├─52  html "<div class=\"center\">" (520:1-520:21, 29168-29188)
 ├─53  html "<div id=\"tab:apha\">" (522:1-522:20, 29190-29209)
 ├─54  paragraph[33] (524:1-533:94, 29211-30150)
 │     ├─0  text "| " (524:1-524:3, 29211-29213)
 │     ├─1  strong[1] (524:3-524:12, 29213-29222)
 │     │    └─0 text "Model" (524:5-524:10, 29215-29220)
 │     ├─2  text "  | " (524:12-524:16, 29222-29226)
 │     ├─3  strong[1] (524:16-524:24, 29226-29234)
 │     │    └─0 text "Task" (524:18-524:22, 29228-29232)
 │     ├─4  text "           | " (524:24-524:37, 29234-29247)
 │     ├─5  strong[1] (524:37-524:49, 29247-29259)
 │     │    └─0 text "Original" (524:39-524:47, 29249-29257)
 │     ├─6  text " | " (524:49-524:52, 29259-29262)
 │     ├─7  strong[1] (524:52-524:58, 29262-29268)
 │     │    └─0 text "FT" (524:54-524:56, 29264-29266)
 │     ├─8  text "   | " (524:58-524:63, 29268-29273)
 │     ├─9  strong[1] (524:63-524:69, 29273-29279)
 │     │    └─0 text "TF" (524:65-524:67, 29275-29277)
 │     ├─10 text "   | " (524:69-524:74, 29279-29284)
 │     ├─11 strong[1] (524:74-524:80, 29284-29290)
 │     │    └─0 text "SF" (524:76-524:78, 29286-29288)
 │     ├─12 text "  | " (524:80-524:84, 29290-29294)
 │     ├─13 strong[1] (524:84-524:92, 29294-29302)
 │     │    └─0 text "MixF" (524:86-524:90, 29296-29300)
 │     ├─14 text " |\n|:-----------|:-------------------|:-------------|:---------|:---------|:--------|:---------|\n| Vicuna-7B  | Spider             | 0.28         | 0.74     | " (524:92-526:63, 29302-29461)
 │     ├─15 strong[1] (526:63-526:71, 29461-29469)
 │     │    └─0 text "0.76" (526:65-526:69, 29463-29467)
 │     ├─16 text " | 0.62    | 0.70     |\n|            | Gsm8k              | 0.58         | 0.74     | " (526:71-527:63, 29469-29555)
 │     ├─17 strong[1] (527:63-527:71, 29555-29563)
 │     │    └─0 text "0.75" (527:65-527:69, 29557-29561)
 │     ├─18 text " | 0.67    | 0.73     |\n|            | Code-search-Python | 0.38         | " (527:71-528:52, 29563-29638)
 │     ├─19 strong[1] (528:52-528:60, 29638-29646)
 │     │    └─0 text "0.65" (528:54-528:58, 29640-29644)
 │     ├─20 text " | " (528:60-528:63, 29646-29649)
 │     ├─21 strong[1] (528:63-528:71, 29649-29657)
 │     │    └─0 text "0.65" (528:65-528:69, 29651-29655)
 │     ├─22 text " | 0.51    | 0.61     |\n|            | Alpaca-finance     | 0.57         | " (528:71-529:52, 29657-29732)
 │     ├─23 strong[1] (529:52-529:60, 29732-29740)
 │     │    └─0 text "0.68" (529:54-529:58, 29734-29738)
 │     ├─24 text " | 0.67     | 0.63    | 0.65     |\n| FLAN T5-XL | Spider             | 0.13         | 0.33     | " (529:60-530:63, 29740-29837)
 │     ├─25 strong[1] (530:63-530:71, 29837-29845)
 │     │    └─0 text "0.78" (530:65-530:69, 29839-29843)
 │     ├─26 text " | 0.67    | 0.70     |\n|            | Gsm8k              | 0.29         | 0.50     | " (530:71-531:63, 29845-29931)
 │     ├─27 strong[1] (531:63-531:71, 29931-29939)
 │     │    └─0 text "0.62" (531:65-531:69, 29933-29937)
 │     ├─28 text " | 0.51    | 0.55     |\n|            | Code-search-Python | 0.28         | 0.44     | " (531:71-532:63, 29939-30025)
 │     ├─29 strong[1] (532:63-532:71, 30025-30033)
 │     │    └─0 text "0.81" (532:65-532:69, 30027-30031)
 │     ├─30 text " | 0.67    | 0.78     |\n|            | Alpaca-finance     | 0.39         | 0.56     | " (532:71-533:63, 30033-30119)
 │     ├─31 strong[1] (533:63-533:71, 30119-30127)
 │     │    └─0 text "0.63" (533:65-533:69, 30121-30125)
 │     └─32 text " | 0.59    | 0.60     |" (533:71-533:94, 30127-30150)
 ├─55  paragraph[5] (535:1-537:50, 30152-30345)
 │     ├─0 text "Token acceptance rates ($\\alpha$) after two epochs. " (535:1-535:53, 30152-30204)
 │     ├─1 strong[1] (535:53-535:59, 30204-30210)
 │     │   └─0 text "FT" (535:55-535:57, 30206-30208)
 │     ├─2 text ": Finetuning\non teacher-generated labels. " (535:59-536:30, 30210-30252)
 │     ├─3 strong[1] (536:30-536:46, 30252-30268)
 │     │   └─0 text "TF, SF, MixF" (536:32-536:44, 30254-30266)
 │     └─4 text ": Teacher, student, and mix\ntoken sampling respectively, all with forward KL." (536:46-537:50, 30268-30345)
 ├─56  html "</div>" (539:1-539:7, 30347-30353)
 ├─57  html "</div>" (541:1-541:7, 30355-30361)
 ├─58  html "</div>" (543:1-543:7, 30363-30369)
 ├─59  heading[1] (545:1-545:21, 30371-30391)
 │     │ depth: 2
 │     └─0 text "Online Evaluation" (545:4-545:21, 30374-30391)
 ├─60  paragraph[2] (547:1-553:59, 30393-30875)
 │     ├─0 strong[1] (547:1-547:21, 30393-30413)
 │     │   └─0 text "Online Learning." (547:3-547:19, 30395-30411)
 │     └─1 text " First, we evaluate the effectiveness of our online\nalgorithm by addressing two key questions: (1) Does the online algorithm\nincrease the token acceptance rate? And is this enhancement comparable\nto the rates achieved in offline settings, which serve as an upper bound\ngiven their full access to data? (2) How quickly does the online\nalgorithm increase the token acceptance rate, thereby indicating that\nthe compact model has grasped the underlying distribution?" (547:21-553:59, 30413-30875)
 ├─61  paragraph[5] (555:1-568:71, 30877-31791)
 │     ├─0 text "In our approach, we replicate the online serving process by iterating\nthrough the datasets, extracting prompts, and streaming generation\nrequests. The system utilizes speculative decoding for each of these\nrequests. Throughout this serving phase, we continually refine the\nspeculative models, as detailed in\nAlgorithm " (555:1-560:11, 30877-31195)
 │     ├─1 html "<a href=\"#algo:1\" data-reference-type=\"ref\"\ndata-reference=\"algo:1\">" (560:11-561:25, 31195-31263)
 │     ├─2 text "[algo:1]" (561:25-561:33, 31263-31271)
 │     ├─3 html "</a>" (561:33-561:37, 31271-31275)
 │     └─4 text ". For our baseline, we envision a\nscenario where the serving system has the capability to collect data\noffline in order to distill an initial draft model. This model is\nsubsequently deployed online to cater to future requests. This process\nis simulated by using 10% of the dataset to distill the draft model,\nwhich remains static during online serving. For evaluation metrics, we\ncalculate token acceptance rates averaged over the most recent 50\nrequests. This demonstrates $M_q$’s efficacy on the most current data." (561:37-568:71, 31275-31791)
 ├─62  html "<figure id=\"fig:alphas\">\n<p><embed src=\"figures/legend_figure1.pdf\" /> <embed\nsrc=\"figures/spider_vicuna.pdf\" /> <embed\nsrc=\"figures/gsm8k_vicuna.pdf\" /> <embed\nsrc=\"figures/python_vicuna.pdf\" /> <embed\nsrc=\"figures/finance_vicuna.pdf\" /> <embed\nsrc=\"figures/spider_flant5xl_to_t5small.pdf\" /> <embed\nsrc=\"figures/gsm8k_flant5xl_to_t5small.pdf\" /> <embed\nsrc=\"figures/python_flant5xl_to_t5small.pdf\" /> <embed\nsrc=\"figures/finance_flant5xl_to_t5small.pdf\" /></p>\n<figcaption>Online acceptance rate (<span\nclass=\"math inline\"><em>α</em></span>) across different datasets. The\nx-axis represents the number of records that OSD has processed. Alpha is\naveraged over the most recent 50 records.</figcaption>\n</figure>" (570:1-584:10, 31793-32505)
 ├─63  html "<figure id=\"fig:dis-shift\">\n<embed src=\"figures/sharp.pdf\" />\n<figcaption>Distribution Shift: Alpha is averaged over the most recent\n100 records.</figcaption>\n</figure>" (586:1-590:10, 32507-32675)
 ├─64  paragraph[1] (592:1-605:18, 32677-33600)
 │     └─0 text "As depicted in Figure 2, both for Vicuna-7B and FLAN-T5, in the\nbeginning, OSD yields a lower token acceptance rate in comparison to the\noffline distilled model. Nevertheless, these acceptance rates rise\nswiftly as the draft model is exposed to more data. We also annotate the\ntoken acceptance rate from the offline setting to highlight the\npotential peak performance that the online serving system could reach.\nIn all instances, the online context can achieve comparable results. In\nsome scenarios, OSD even surpasses the token acceptance rate of the\noffline test alphas. This discrepancy can be attributed to the fact that\noffline test alphas are assessed on the entire test dataset, whereas the\nonline alphas represent the moving average of the latest 50 requests.\nIt’s plausible that OSD performs optimally on specific data subsets,\nparticularly if those subsets are more narrowly distributed than the\ncomplete dataset." (592:1-605:18, 32677-33600)
 ├─65  paragraph[10] (607:1-618:15, 33602-34331)
 │     ├─0 strong[1] (607:1-607:25, 33602-33626)
 │     │   └─0 text "Distribution Shifts." (607:3-607:23, 33604-33624)
 │     ├─1 text " We evaluate OSD’s ability to adapt to changes\nin data distribution. We detail the dataset preparation in\nAppendix " (607:25-609:10, 33626-33741)
 │     ├─2 html "<a href=\"#appendix:distribution-shift\" data-reference-type=\"ref\"\ndata-reference=\"appendix:distribution-shift\">" (609:10-610:46, 33741-33851)
 │     ├─3 text "[appendix:distribution-shift]" (610:46-610:75, 33851-33880)
 │     ├─4 html "</a>" (610:75-610:79, 33880-33884)
 │     ├─5 text ".\nAs illustrated in\nFigure " (610:79-612:8, 33884-33911)
 │     ├─6 html "<a href=\"#fig:dis-shift\" data-reference-type=\"ref\"\ndata-reference=\"fig:dis-shift\">" (612:8-613:32, 33911-33993)
 │     ├─7 text "4" (613:32-613:33, 33993-33994)
 │     ├─8 html "</a>" (613:33-613:37, 33994-33998)
 │     └─9 text ", OSD’s alpha value dips notably at\ndistribution boundaries, especially around 2K, 4K, and 6K records. This\nis anticipated since the draft model initially struggles when faced with\na new distribution. However, the alpha value rebounds quickly as OSD\nprocesses more data, highlighting its adaptability to shifting query\ndistributions." (613:37-618:15, 33998-34331)
 ├─66  paragraph[5] (620:1-632:10, 34333-35152)
 │     ├─0 text "We also compared our results to those from a static setting. To ensure\nthe draft model wasn’t just memorizing data, we chose samples distinct\nfrom the online evaluation data. These samples correspond to 30%, 50%,\n70%, and 100% of each dataset’s online evaluation volume, at 0.6K, 1K,\n1.4K, and 2K quantities respectively. As depicted in\nFigure " (620:1-625:8, 34333-34677)
 │     ├─1 html "<a href=\"#fig:dis-shift\" data-reference-type=\"ref\"\ndata-reference=\"fig:dis-shift\">" (625:8-626:32, 34677-34759)
 │     ├─2 text "4" (626:32-626:33, 34759-34760)
 │     ├─3 html "</a>" (626:33-626:37, 34760-34764)
 │     └─4 text ", upon an initial shift in query\ndistribution, OSD’s performance aligns with or slightly trails the\ndistillation with 30% data. However, it quickly catches up, matching or\neven surpassing performances seen with 70% to 100% data access. This\nhighlights OSD’s ability to rival models fully exposed to the query\ndistribution, even without intimate knowledge of the underlying query\ndynamics." (626:37-632:10, 34764-35152)
 ├─67  paragraph[14] (634:1-651:63, 35154-36337)
 │     ├─0  strong[1] (634:1-634:20, 35154-35173)
 │     │    └─0 text "Real Workloads." (634:3-634:18, 35156-35171)
 │     ├─1  text " We evaluate OSD on real LMSYS-chat conversations\n(Appendix  " (634:20-635:12, 35173-35234)
 │     ├─2  html "<a href=\"#appendix:arena\" data-reference-type=\"ref\"\ndata-reference=\"appendix:arena\">" (635:12-636:33, 35234-35318)
 │     ├─3  text "7.6" (636:33-636:36, 35318-35321)
 │     ├─4  html "</a>" (636:36-636:40, 35321-35325)
 │     ├─5  text ") that span 4 months. First, we\ncategorize conversations based on the language and we focus on\nconversations among the top five languages, excluding English. For every\nchosen language, we use an independent LLaMA-160M to serve as our draft\nmodel. All draft models share the same Vicuna-7B as the target model.\nThe token acceptance rate, averaged over the latest 100 requests, showed\nin Figure " (636:40-642:11, 35325-35718)
 │     ├─6  html "<a href=\"#fig:arena\" data-reference-type=\"ref\"\ndata-reference=\"fig:arena\">" (642:11-643:28, 35718-35792)
 │     ├─7  text "5" (643:28-643:29, 35792-35793)
 │     ├─8  html "</a>" (643:29-643:33, 35793-35797)
 │     ├─9  text ", reveals that OSD’s enhances rates by\n0.1 to 0.2, even with under 2K data points. Notably, Japanese was the\neasiest while Portuguese was the toughest. We also clustered English\nconversations by topics using the fine-tuned distilled Bert model ,\nfocusing on the top five. For topics with over 5K conversations, we\nsampled evenly to keep it within 5K.\nFigure " (643:33-649:8, 35797-36155)
 │     ├─10 html "<a href=\"#fig:arena\" data-reference-type=\"ref\"\ndata-reference=\"fig:arena\">" (649:8-650:28, 36155-36229)
 │     ├─11 text "5" (650:28-650:29, 36229-36230)
 │     ├─12 html "</a>" (650:29-650:33, 36230-36234)
 │     └─13 text " shows acceptance rates above 0.6 across\ntopics, with Social and Computer discussions peaking near 0.8." (650:33-651:63, 36234-36337)
 ├─68  html "<figure id=\"fig:arena\">\n<p><embed src=\"figures/arena_language.pdf\" /> <embed\nsrc=\"figures/arena_class.pdf\" /></p>\n<figcaption>Chatbot Arena Conversations clustered by language and\ntopic.</figcaption>\n</figure>" (653:1-658:10, 36339-36548)
 ├─69  html "<figure id=\"fig:freq-acc\">\n<p><embed src=\"figures/precision.pdf\" /> <embed\nsrc=\"figures/recall.pdf\" /></p>\n<figcaption>Precision and recall of high-frequency tokens. The x-axis\nshows the rating of the tokens based on their occurrence in the\ngenerated answers. For instance, token 1 appears most frequently in\nanswers. Precision = # of times token <span\nclass=\"math inline\"><em>i</em></span> is accepted by the target model /\n# of times token <span class=\"math inline\"><em>i</em></span> is proposed\nby the draft model. Recall = # of times token <span\nclass=\"math inline\"><em>i</em></span> is accepted by the target model /\n# of times token <span class=\"math inline\"><em>i</em></span> appears in\nthe final answer.</figcaption>\n</figure>" (660:1-673:10, 36550-37284)
 ├─70  heading[1] (675:1-675:24, 37286-37309)
 │     │ depth: 2
 │     └─0 text "Qualitative Analysis" (675:4-675:24, 37289-37309)
 ├─71  paragraph[1] (677:1-679:57, 37311-37506)
 │     └─0 text "In this section, we conduct a comprehensive analysis to understand how\nour method enhances the token acceptance rate, and which tokens the\ndraft model acquires across varying query distributions." (677:1-679:57, 37311-37506)
 ├─72  paragraph[6] (681:1-689:34, 37508-38042)
 │     ├─0 strong[1] (681:1-681:48, 37508-37555)
 │     │   └─0 text "High-frequency tokens precision and recall." (681:3-681:46, 37510-37553)
 │     ├─1 text " In our experiment using\nthe Spider dataset, Vicuna-7M is the target model and LLaMA-160M the\ndraft. We identify the top 100 tokens most frequently generated by the\ntarget model, which account for 72.2% of all appearances, following a\npower-law distribution.\nFigure " (681:48-686:8, 37555-37821)
 │     ├─2 html "<a href=\"#fig:freq-acc\" data-reference-type=\"ref\"\ndata-reference=\"fig:freq-acc\">" (686:8-687:31, 37821-37901)
 │     ├─3 text "6" (687:31-687:32, 37901-37902)
 │     ├─4 html "</a>" (687:32-687:36, 37902-37906)
 │     └─5 text " shows a marked improvement in both\naccuracy and recall of these tokens after distillation on the test\ndataset in an offline evaluation." (687:36-689:34, 37906-38042)
 ├─73  html "<div class=\"center\">" (691:1-691:21, 38044-38064)
 ├─74  html "<div class=\"footnotesize\">" (693:1-693:27, 38066-38092)
 ├─75  html "<div id=\"tab:tokens\">" (695:1-695:22, 38094-38115)
 ├─76  paragraph[15] (697:1-700:402, 38117-39724)
 │     ├─0  text "| " (697:1-697:3, 38117-38119)
 │     ├─1  strong[1] (697:3-697:14, 38119-38130)
 │     │    └─0 text "Dataset" (697:5-697:12, 38121-38128)
 │     ├─2  text "                                     | " (697:14-697:53, 38130-38169)
 │     ├─3  strong[1] (697:53-697:63, 38169-38179)
 │     │    └─0 text "Spider" (697:55-697:61, 38171-38177)
 │     ├─4  text "                                                                         | " (697:63-697:138, 38179-38254)
 │     ├─5  strong[1] (697:138-697:147, 38254-38263)
 │     │    └─0 text "Gsm8k" (697:140-697:145, 38256-38261)
 │     ├─6  text "                                                                     | " (697:147-697:218, 38263-38334)
 │     ├─7  strong[1] (697:218-697:236, 38334-38352)
 │     │    └─0 text "Alpaca-Finance" (697:220-697:234, 38336-38350)
 │     ├─8  text "                                                                              | " (697:236-697:316, 38352-38432)
 │     ├─9  strong[1] (697:316-697:331, 38432-38447)
 │     │    └─0 text "Code-Python" (697:318-697:329, 38434-38445)
 │     ├─10 text "                                                                      |\n|:------------------------------------------------|:-----------------------------------------------------------------------------------|:------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|\n| " (697:331-699:3, 38447-38923)
 │     ├─11 strong[1] (699:3-699:50, 38923-38970)
 │     │    └─0 text "Tokens with the greatest precision increase" (699:5-699:48, 38925-38968)
 │     ├─12 text " | AV, SELECT, first, ⟨EOS⟩, template, SUM, G, COUNT, \\n, city, WHERE, ’;, (, IST, id | ⟨EOS⟩, >>, +, To, <<, this, =, %, know, are, We, calculate, be, The, have | 1, Here, (, :, provide, depends, However, goals, amount, 3, there, The, \\n, personal, will      | ”’, (, Here, python, ’, how, doc, snippet, import, based, {, Python, This, :, you    |\n| " (699:50-700:3, 38970-39325)
 │     ├─13 strong[1] (700:3-700:47, 39325-39369)
 │     │    └─0 text "Tokens with the greatest recall increase" (700:5-700:45, 39327-39367)
 │     └─14 text "    | SELECT, *, FROM, (, IST, *), \\n, COUNT, G, first, WHERE, ⟨EOS⟩, IN, ;, MAX, ’;   | start, >>, <<, +, find, how, we, =, fore, To, so, \\ ⟨EOS⟩, then, let     | general, 1, several, This, depends, Here, provide, However, goals, over, (, If, amount, it, can | Here, This, snippet, ”’, ’, how, python, (, takes, Python, you, doc, an, import, def |" (700:47-700:402, 39369-39724)
 ├─77  paragraph[1] (702:1-704:17, 39726-39874)
 │     └─0 text "Top 15 tokens with the most recall/precision improvement across\ndatasets. We ignore _ before tokens, which represents space in the\nLLaMA tokenizer." (702:1-704:17, 39726-39874)
 ├─78  html "</div>" (706:1-706:7, 39876-39882)
 ├─79  html "</div>" (708:1-708:7, 39884-39890)
 ├─80  html "</div>" (710:1-710:7, 39892-39898)
 ├─81  paragraph[6] (712:1-723:58, 39900-40712)
 │     ├─0 strong[1] (712:1-712:45, 39900-39944)
 │     │   └─0 text "Tokens learned across different datasets" (712:3-712:43, 39902-39942)
 │     ├─1 text " In our study, we analyze\nthe top 10 tokens with the most pronounced accuracy and recall\nimprovements across various datasets, focusing on the 100 most frequent\ntokens to understand the draft model’s learning trends. As detailed in\nTable " (712:45-716:7, 39944-40182)
 │     ├─2 html "<a href=\"#tab:tokens\" data-reference-type=\"ref\"\ndata-reference=\"tab:tokens\">" (716:7-717:29, 40182-40258)
 │     ├─3 text "2" (717:29-717:30, 40258-40259)
 │     ├─4 html "</a>" (717:30-717:34, 40259-40263)
 │     └─5 text ", the improved tokens align well with\nthe underlying data distribution. For example, in the Spider dataset,\nwhich frequently generates SQL statements, tokens like SELECT and WHERE\nhave notably higher acceptance rates post-distillation. Similarly, in\nthe Graduate Math dataset (Gsm8k), tokens such as <<, >>, =, and +\nstand out. These patterns highlight the draft model’s ability to adapt\nand predict tokens consistent with the data distribution." (717:34-723:58, 40263-40712)
 ├─82  heading[1] (725:1-725:13, 40714-40726)
 │     │ depth: 1
 │     └─0 text "Conclusion" (725:3-725:13, 40716-40726)
 ├─83  paragraph[1] (727:1-732:57, 40728-41120)
 │     └─0 text "Speculative decoding’s efficiently hinges on the draft model’s\napproximation to the target model. We introduce an online speculative\nmethod that continuously enhances the draft model based on varying data\ndistributions. Experiments on both synthetic and real data demonstrate\nthat online speculative decoding swiftly adapts to new data\ndistributions, significantly enhancing token acceptance." (727:1-732:57, 40728-41120)
 ├─84  heading[1] (734:1-734:11, 41122-41132)
 │     │ depth: 1
 │     └─0 text "Appendix" (734:3-734:11, 41124-41132)
 ├─85  heading[1] (736:1-736:35, 41134-41168)
 │     │ depth: 2
 │     └─0 text "Speedup of Speculative Decoding" (736:4-736:35, 41137-41168)
 ├─86  paragraph[1] (738:1-749:56, 41170-41949)
 │     └─0 text "As proved in  , compared with standard decoding, the expected\nimprovement factor for offline speculative decoding is\n$\\frac{1-\\alpha^{k+1}}{(1-\\alpha)(ck+1)}$. Let the time taken for a\nsingle run of $M_p$ be $T$. Define $c$, the cost coefficient, as the\nratio of the time taken for a single run of $M_q$ to that of $M_p$. Each\nexecution of lines 7 to 8 takes $Tck + T$ and, on average, yields\n$\\frac{1-\\alpha^{k+1}}{1-\\alpha}$ tokens. As a result, the average time\nto produce one token using speculative decoding is given by\n$\\frac{(ck+1)(1-\\alpha)}{1-\\alpha^{k+1}}T$. In contrast, the time to\nproduce a single token using standard decoding is $T$. Hence, the\nwallclock time reduction of offline speculative decoding can be\ndescribed as $\\frac{1-\\alpha^{k+1}}{(1-\\alpha)(ck+1)}$." (738:1-749:56, 41170-41949)
 ├─87  heading[1] (751:1-751:20, 41951-41970)
 │     │ depth: 2
 │     └─0 text "Latency Analysis" (751:4-751:20, 41954-41970)
 ├─88  paragraph[5] (753:1-762:43, 41972-42674)
 │     ├─0 text "Suppose OSD can improve the token acceptance rate from $\\alpha_1$ to\n$\\alpha_2$ and $T$ is the generation time for standard decoding. Based\non Equation " (753:1-755:13, 41972-42124)
 │     ├─1 html "<a href=\"#eq:gen_len\" data-reference-type=\"ref\"\ndata-reference=\"eq:gen_len\">" (755:13-756:29, 42124-42200)
 │     ├─2 text "[eq:gen_len]" (756:29-756:41, 42200-42212)
 │     ├─3 html "</a>" (756:41-756:45, 42212-42216)
 │     └─4 text ", this improvement leads to\na decrease in the average generation time for each token, transitioning\nfrom $\\frac{(ck+1)(1-\\alpha_1)}{1-\\alpha_{1}^{k+1}}T$ to\n$\\frac{(ck+1)(1-\\alpha_2)}{1-\\alpha_{2}^{k+1}}T$. Consequently, this\nresults in a speedup factor of\n$\\frac{1-\\alpha_2^{k+1}}{1-\\alpha_1^{k+1}}\\frac{1-\\alpha_1}{1-\\alpha_2} = \\frac{1+\\alpha_2+\\alpha_2^2+...+\\alpha_2^{k}}{1+\\alpha_1+\\alpha_1^2+...+\\alpha_1^k}$\ncompared to standard speculative decoding." (756:45-762:43, 42216-42674)
 ├─89  paragraph[1] (764:1-774:69, 42676-43440)
 │     └─0 text "In the aforementioned analysis, we omitted the additional latency due to\nupdating the smaller model for the following reasons: (1) As illustrated\nsubsequently, the additional computational cost (FLOPs) from the update\nremains marginal when juxtaposed with the computational demands of\nrunning the larger model. (2) Updates are periodic, during times of\nmoderate request loads, the latency for serving individual requests\nremains largely unaffected. Additionally, given that the update\noperation for the smaller model is considerably less resource-intensive\nthan inference, the associated latency might be seamlessly masked,\nrendering it virtually imperceptible. Lastly, the processes of updating\nand inference can even be executed concurrently on separate devices." (764:1-774:69, 42676-43440)
 ├─90  heading[1] (776:1-776:18, 43442-43459)
 │     │ depth: 2
 │     └─0 text "Flops Analysis" (776:4-776:18, 43445-43459)
 ├─91  paragraph[6] (778:1-803:79, 43461-45107)
 │     ├─0 emphasis[1] (778:1-779:51, 43461-43581)
 │     │   └─0 text "The FLOPs required to update the draft model are significantly fewer\nthan those needed for inference on a large model." (778:2-779:50, 43462-43580)
 │     ├─1 text " Denote $L$ as the\naverage length of the generated sequence. For each verification, the\ndraft model suggests $k$ tokens. The expected length for a single run of\nthe target LLM, denoted as $a$, can be calculated using\nEquation " (779:51-783:10, 43581-43807)
 │     ├─2 html "<a href=\"#eq:gen_len\" data-reference-type=\"ref\"\ndata-reference=\"eq:gen_len\">" (783:10-784:29, 43807-43883)
 │     ├─3 text "[eq:gen_len]" (784:29-784:41, 43883-43895)
 │     ├─4 html "</a>" (784:41-784:45, 43895-43899)
 │     └─5 text ". Therefore, OSD undergoes\nthe verification process $\\frac{L}{a}$ times, with each time verifying\n$k+1$ tokens. We use $F_{qfwd}$ to represent the arithmetic operations\nrequired by a singular forward run of the draft model for each token,\nand $F_{pfwd}$ stands for the FLOPs needed for a single forward run of\nthe target model per token. Therefore, the computational demand (in\nFLOPs) for the draft and teacher models to handle one request can be\nexpressed as:\n$\\text{FLOPs}(draft)  = \\frac{L}{a} \\times k \\times F_{qfwd},\n\\text{FLOPs}(target) = \\frac{L}{a} \\times (k+1) \\times F_{pfwd}.$ Let’s\nconsider the FLOPs required to update the student model per token as\n$F_{qbwd}$. The cumulative FLOPs necessary to process $I$ requests is\ngiven by:\n$$\\frac{LI}{a} \\times \\left[k \\times F_{qfwd} + (k+1) \\times F_{pfwd}\\right] + I \\times L \\times F_{qbwd}.$$\nBased on the findings of , training is approximately three times\ncostlier than inference. This translates to roughly 6 FLOPs per\nparameter for training on a single token and 2 FLOPs per parameter for\ninferring on one token. Thus, we can simplify the total FLOPs expression\nto:\n$$\\frac{LI}{a}\\left[(k + 3a) \\times F_{qfwd} + (k+1) \\times F_{pfwd}\\right].$$" (784:45-803:79, 43899-45107)
 ├─92  paragraph[1] (805:1-814:62, 45109-45660)
 │     └─0 text "The proportion of FLOPs needed to run the target model to that of the\ndraft model is given by:\n$$\\frac{(k+1)\\times F_{pfwd}}{(k+3a)\\times F_{qfwd}}.$$ For the two\nmodel pairs evaluated, assuming an average of 5 proposed tokens per run:\n(1) (LLaMA-160M, Vicuna7B) with an average acceptance rate of 0.71, the\nratio is approximately\n$\\frac{(5+1) \\times 7B}{(5+3 \\times 3) \\times 160M} = 18.75$. (2)\n(T5-small 80M, Flan-T5-XL 3B), with an average acceptance rate of 0.76,\nthe ratio is roughly\n$\\frac{(5+1) \\times 3B}{(5+3 \\times 4.3) \\times 80M} = 12.6$." (805:1-814:62, 45109-45660)
 ├─93  paragraph[2] (816:1-830:28, 45662-46665)
 │     ├─0 emphasis[1] (816:1-817:45, 45662-45766)
 │     │   └─0 text "In practical systems, the FLOPs required for inference are\nsignificantly below the machine’s capacity." (816:2-817:44, 45663-45765)
 │     └─1 text " Consider the\nLMSYS-Chat-1M . It comprises traces spanning 125 days with 1000,000\nrequests, averaging less than 2,000 tokens per request (including both\nprompts and responses). When serving a 30B model with 8 A100 GPUs, the\nFLOPs consumed per second can be estimated as (Still, we estimate 2\nFLOPs per token per parameter):\n$$\\frac{2000 \\times 1000,000}{125 \\times 24 \\times 3600} \\times 30 \\times 10^9 \\times 2 = 5.5 \\times 10^9 \\text{ FLOPs or 5.5 GFLOPs}$$\nOn the other hand, 8 A100 GPUs offer a combined capacity of\n$8 \\times 312 \\text{ TFLOPs}$, and the computational utilization is\nnotably low. While Arena (the platform that generates LMSYS-Chat-1M) may\nnot be the most efficient and might lack substantial traffic, it’s the\nonly publicly accessible LLM service trace. Even after amplifying the\nload multiple times, based on the above calculations, the computation\nefficiency remains limited." (817:45-830:28, 45766-46665)
 ├─94  heading[1] (832:1-832:12, 46667-46678)
 │     │ depth: 2
 │     └─0 text "Data Mix" (832:4-832:12, 46670-46678)
 ├─95  paragraph[5] (834:1-852:12, 46680-47896)
 │     ├─0 text "Moreover, there is a question of whether the draft model, once adapted\nto the new distribution, might lose its prior knowledge. To probe this,\nwe conducted an experiment mixing 2k prompts each from the Gsm8k and\nAlpaca-finance datasets. During online serving, for the initial 2k\nrequests, we only update the model based on data from the Gsm8k dataset.\nFor the subsequent half of the requests, we restrict updates solely to\ndata from the Alpaca-finance dataset. We then provide the average token\nacceptance rates for all requests, segmented by their data source (Gsm8k\nversus Alpaca-finance). As depicted in\nFigure " (834:1-843:8, 46680-47294)
 │     ├─1 html "<a href=\"#fig:mix\" data-reference-type=\"ref\"\ndata-reference=\"fig:mix\">" (843:8-844:26, 47294-47364)
 │     ├─2 text "7" (844:26-844:27, 47364-47365)
 │     ├─3 html "</a>" (844:27-844:31, 47365-47369)
 │     └─4 text ", the token acceptance rate for Gsm8k\nincreases as the draft model is exposed to more data. Conversely, the\nacceptance rate ($\\alpha$) for the Alpaca-finance dataset remains\nconsistent. This is anticipated since we only update the draft model\nusing Gsm8k data. In the latter half of the dataset, the token\nacceptance rate for the Alpaca-finance dataset also shows an uptrend.\nIntriguingly, the rate for Gsm8k remains consistent, suggesting that the\ndraft model retains its learned knowledge without showing signs of\nforgetting." (844:31-852:12, 47369-47896)
 ├─96  html "<figure id=\"fig:mix\">\n<p><embed src=\"figures/appendix_legend.pdf\" /><br />\n<embed src=\"figures/mix.pdf\" /></p>\n<figcaption>Mix of distributions.</figcaption>\n</figure>" (854:1-858:10, 47898-48065)
 ├─97  heading[1] (860:1-860:51, 48067-48117)
 │     │ depth: 2
 │     └─0 text "Data Preparation for Distribution Shift Analysi" (860:4-860:51, 48070-48117)
 ├─98  paragraph[4] (862:1-867:61, 48119-48459)
 │     ├─0 text "To emulate this shift in distribution,\n" (862:1-863:1, 48119-48158)
 │     ├─1 html "<span id=\"appendix:distribution-shift\"\nlabel=\"appendix:distribution-shift\">" (863:1-864:37, 48158-48233)
 │     ├─2 html "</span>" (864:37-864:44, 48233-48240)
 │     └─3 text " we select 2k prompts from\neach dataset under evaluation. T he data from the four datasets are\namalgamated by direct concatenation, such that the records from\n$i\\times2k$ to $(i+1)\\times2k$ belong solely to dataset $i$." (864:44-867:61, 48240-48459)
 ├─99  heading[1] (869:1-869:17, 48461-48477)
 │     │ depth: 2
 │     └─0 text "Arena Dataset" (869:4-869:17, 48464-48477)
 ├─100 paragraph[1] (871:1-876:15, 48479-48841)
 │     └─0 text "For expedited experimental evaluation, we randomly sample a subset with\n10K records from LMSYS-Chat-1M , a comprehensive real-world LLM\nconversation dataset. This dataset encompasses interactions with 25\nmodels spanning from April to August 2023 and features conversations in\nover 150 languages. For all experiments, we only pick conversations for\nVicuna models." (871:1-876:15, 48479-48841)
 └─101 paragraph[3] (878:1-878:58, 48843-48900)
      ├─0 text "[^1]: ${{\\bm{y}}}" (878:1-878:18, 48843-48860)
      ├─1 emphasis[1] (878:18-878:45, 48860-48887)
      │   └─0 text "{<i}$ refers to ${ y_j}" (878:19-878:44, 48861-48886)
      └─2 text "{j=1}^{i-1}$." (878:45-878:58, 48887-48900)
diff --git a/torch.md b/torch.md
diff --git a/torch.tree b/torch.tree
 root[124] (1:1-850:1, 0-43048)
 ├─0   heading[1] (1:1-1:15, 0-14)
 │     │ depth: 1
 │     └─0 text "Introduction" (1:3-1:15, 2-14)
 ├─1   paragraph[1] (3:1-13:13, 16-713)
 │     └─0 text "With the increased interest in deep learning in recent years, there has\nbeen an explosion of machine learning tools. Many popular frameworks\nsuch as Caffe (\"Jia et al. \"2014\"), CNTK (Seide and Agarwal 2016),\nTensorFlow (Abadi et al. 2015), and Theano (Theano Development Team\n2016), construct a static dataflow graph that represents the computation\nand which can then be applied repeatedly to batches of data. This\napproach provides visibility into the whole computation ahead of time,\nand can theoretically be leveraged to improve performance and\nscalability. However, it comes at the cost of ease of use, ease of\ndebugging, and flexibility of the types of computation that can be\nrepresented." (3:1-13:13, 16-713)
 ├─2   paragraph[1] (15:1-20:42, 715-1091)
 │     └─0 text "Prior work has recognized the value of dynamic eager execution for deep\nlearning, and some recent frameworks implement this define-by-run\napproach, but do so either at the cost of performance (Chainer (Tokui et\nal. 2015)) or using a less expressive, faster language\n(Torch (Collobert, Bengio, and Mariéthoz 2002), DyNet (Neubig et al.\n2017)), which limits their applicability." (15:1-20:42, 715-1091)
 ├─3   paragraph[1] (22:1-29:66, 1093-1644)
 │     └─0 text "However, with careful implementation and design choices, dynamic eager\nexecution can be achieved largely without sacrificing performance. This\npaper introduces PyTorch, a Python library that performs immediate\nexecution of dynamic tensor computations with automatic differentiation\nand GPU acceleration, and does so while maintaining performance\ncomparable to the fastest current libraries for deep learning. This\ncombination has turned out to be very popular in the research community\nwith, for instance, 296 ICLR 2019 submissions mentioning PyTorch." (22:1-29:66, 1093-1644)
 ├─4   heading[1] (31:1-31:13, 1646-1658)
 │     │ depth: 1
 │     └─0 text "Background" (31:3-31:13, 1648-1658)
 ├─5   paragraph[1] (33:1-34:29, 1660-1755)
 │     └─0 text "Four major trends in scientific computing have become increasingly\nimportant for deep learning." (33:1-34:29, 1660-1755)
 ├─6   paragraph[5] (36:1-45:35, 1757-2423)
 │     ├─0 text "First, starting in the 1960s, the development of domain specific\nlanguages such as APL (Abrams 1970), MATLAB (" (36:1-37:46, 1757-1867)
 │     ├─1 emphasis[1] (37:46-38:9, 1867-1898)
 │     │   └─0 text "MATLAB and Statistics\nToolbox" (37:47-38:8, 1868-1897)
 │     ├─2 text ", n.d.), R (R Core Team, n.d.) and Julia (Bezanson et al. 2017),\nturned multidimensional arrays (often referred to as tensors) into\nfirst-class objects supported by a comprehensive set of mathematical\nprimitives (or operators) to manipulate them. Separately, libraries such\nas NumPy(Oliphant 2006), Torch(Collobert, Bengio, and Mariéthoz 2002),\nEigen(Guennebaud, Jacob, et al. 2010) and Lush(Y. LeCun and Bottou 2002)\nmade " (38:9-44:6, 1898-2321)
 │     ├─3 strong[1] (44:6-44:33, 2321-2348)
 │     │   └─0 text "array-based programming" (44:8-44:31, 2323-2346)
 │     └─4 text " productive in general purpose languages\nsuch as Python, Lisp, C++ and Lua." (44:33-45:35, 2348-2423)
 ├─7   paragraph[3] (47:1-56:34, 2425-3081)
 │     ├─0 text "Second, the development of " (47:1-47:28, 2425-2452)
 │     ├─1 strong[1] (47:28-47:57, 2452-2481)
 │     │   └─0 text "automatic differentiation" (47:30-47:55, 2454-2479)
 │     └─2 text " (Baydin et al.\n2017) made it possible to fully automate the daunting labor of computing\nderivatives. This made it significantly easier to experiment with\ndifferent machine learning approaches while still allowing for efficient\ngradient based optimization. The autograd (Maclaurin 2016) package\npopularized the use of this technique for NumPy arrays, and similar\napproaches are used in frameworks such as Chainer (Tokui et al. 2015),\nDyNet (Neubig et al. 2017), Lush (Y. LeCun and Bottou 2002),\nTorch (Collobert, Bengio, and Mariéthoz 2002), Jax (M. J. et. al. 2018)\nand Flux.jl (M. I. et. al. 2018)." (47:57-56:34, 2481-3081)
 ├─8   paragraph[5] (58:1-78:9, 3083-4454)
 │     ├─0 text "Third, with the advent of the free software movement, the scientific\ncommunity moved away from closed proprietary software such as\nMatlab(" (58:1-60:8, 3083-3221)
 │     ├─1 emphasis[1] (60:8-60:39, 3221-3252)
 │     │   └─0 text "MATLAB and Statistics Toolbox" (60:9-60:38, 3222-3251)
 │     ├─2 text ", n.d.), and towards the\n" (60:39-61:1, 3252-3277)
 │     ├─3 strong[1] (61:1-61:33, 3277-3309)
 │     │   └─0 text "open-source Python ecosystem" (61:3-61:31, 3279-3307)
 │     └─4 text " with packages like NumPy (Oliphant\n2006), SciPy (Jones et al. 2001--), and Pandas (McKinney 2010). This\nfulfilled most of the numerical analysis needs of researchers while\nallowing them to take advantage of a vast repository of libraries to\nhandle dataset preprocessing, statistical analysis, plotting, and more.\nMoreover, the openness, interoperability, and flexibility of free\nsoftware fostered the development of vibrant communities that could\nquickly address new or changing needs by extending the existing\nfunctionality of a library or if needed by developing and releasing\nbrand new ones. While there is a rich offering of open-source software\nfor neural networks in languages other than Python, starting with\nLush (Y. LeCun and Bottou 2002) in Lisp, Torch (Collobert, Bengio, and\nMariéthoz 2002) in C++, Objective-C and Lua, EBLearn (Sermanet,\nKavukcuoglu, and LeCun 2009) in C++, Caffe (\"Jia et al. \"2014\") in\nC++, the network effects of a large ecosystem such as Python made it an\nessential skill to jumpstart one's research. Hence, since 2014, most\ndeep learning frameworks converged on a Python interface as an essential\nfeature." (61:33-78:9, 3309-4454)
 ├─9   paragraph[3] (80:1-88:36, 4456-5037)
 │     ├─0 text "Finally, the availability and commoditization of general-purpose\nmassively parallel hardware such as GPUs provided the computing power\nrequired by deep learning methods. Specialized libraries such as\ncuDNN (Chetlur et al. 2014), along with a body of academic work (such as\n(Lavin 2015) and (Lavin and Gray 2016)), produced a set of\nhigh-performance reusable deep learning kernels that enabled frameworks\nsuch as Caffe (\"Jia et al. \"2014\"), Torch7 (Collobert, Kavukcuoglu,\nand Farabet 2011), or TensorFlow (Abadi et al. 2015) to take advantage\nof these " (80:1-88:10, 4456-5011)
 │     ├─1 strong[1] (88:10-88:35, 5011-5036)
 │     │   └─0 text "hardware accelerators" (88:12-88:33, 5013-5034)
 │     └─2 text "." (88:35-88:36, 5036-5037)
 ├─10  paragraph[1] (90:1-92:52, 5039-5220)
 │     └─0 text "PyTorch builds on these trends by providing an array-based programming\nmodel accelerated by GPUs and differentiable via automatic\ndifferentiation integrated in the Python ecosystem." (90:1-92:52, 5039-5220)
 ├─11  heading[1] (94:1-94:20, 5222-5241)
 │     │ depth: 1
 │     └─0 text "Design principles" (94:3-94:20, 5224-5241)
 ├─12  paragraph[1] (96:1-98:13, 5243-5396)
 │     └─0 text "PyTorch's success stems from weaving previous ideas into a design that\nbalances speed and ease of use. There are four main principles behind\nour choices:" (96:1-98:13, 5243-5396)
 ├─13  paragraph[2] (100:1-105:57, 5398-5797)
 │     ├─0 strong[1] (100:1-100:16, 5398-5413)
 │     │   └─0 text "Be Pythonic" (100:3-100:14, 5400-5411)
 │     └─1 text " Data scientists are familiar with the Python language,\nits programming model, and its tools. PyTorch should be a first-class\nmember of that ecosystem. It follows the commonly established design\ngoals of keeping interfaces simple and consistent, ideally with one\nidiomatic way of doing things. It also integrates naturally with\nstandard plotting, debugging, and data processing tools." (100:16-105:57, 5413-5797)
 ├─14  paragraph[2] (107:1-111:48, 5799-6114)
 │     ├─0 strong[1] (107:1-107:26, 5799-5824)
 │     │   └─0 text "Put researchers first" (107:3-107:24, 5801-5822)
 │     └─1 text " PyTorch strives to make writing models, data\nloaders, and optimizers as easy and productive as possible. The\ncomplexity inherent to machine learning should be handled internally by\nthe PyTorch library and hidden behind intuitive APIs free of\nside-effects and unexpected performance cliffs." (107:26-111:48, 5824-6114)
 ├─15  paragraph[4] (113:1-121:15, 6116-6680)
 │     ├─0 strong[1] (113:1-113:34, 6116-6149)
 │     │   └─0 text "Provide pragmatic performance" (113:3-113:32, 6118-6147)
 │     ├─1 text " To be useful, PyTorch needs to deliver\ncompelling performance, although not at the expense of simplicity and\nease of use. Trading 10% of speed for a significantly simpler to use\nmodel is acceptable; 100% is not. Therefore, its " (113:34-116:50, 6149-6377)
 │     ├─2 emphasis[1] (116:50-116:66, 6377-6393)
 │     │   └─0 text "implementation" (116:51-116:65, 6378-6392)
 │     └─3 text "\naccepts added complexity in order to deliver that performance.\nAdditionally, providing tools that allow researchers to manually control\nthe execution of their code will empower them to find their own\nperformance improvements independent of those that the library provides\nautomatically." (116:66-121:15, 6393-6680)
 ├─16  paragraph[2] (123:1-129:29, 6682-7131)
 │     ├─0 strong[1] (123:1-123:20, 6682-6701)
 │     │   └─0 text "Worse is better" (123:3-123:18, 6684-6699)
 │     └─1 text " (Gabriel, n.d.) Given a fixed amount of engineering\nresources, and all else being equal, the time saved by keeping the\ninternal implementation of PyTorch simple can be used to implement\nadditional features, adapt to new situations, and keep up with the fast\npace of progress in the field of AI. Therefore it is better to have a\nsimple but slightly incomplete solution than a comprehensive but complex\nand hard to maintain design." (123:20-129:29, 6701-7131)
 ├─17  heading[1] (131:1-131:27, 7133-7159)
 │     │ depth: 1
 │     └─0 text "Usability centric design" (131:3-131:27, 7135-7159)
 ├─18  heading[1] (133:1-133:49, 7161-7209)
 │     │ depth: 2
 │     └─0 text "Deep learning models are just Python programs" (133:4-133:49, 7164-7209)
 ├─19  paragraph[1] (135:1-148:52, 7211-8140)
 │     └─0 text "In a surprisingly short amount of time, machine learning grew from\nrecognizing individual digits (Yann LeCun and Cortes, n.d.) into\nautonomously playing StarCraft (Vinyals et al. 2017). Consequently, the\nneural networks themselves evolved rapidly from simple sequences of feed\nforward layers into incredibly varied numerical programs often composed\nof many loops and recursive functions. To support this growing\ncomplexity, PyTorch foregoes the potential benefits of a\ngraph-metaprogramming based approach to preserve the imperative\nprogramming model of Python. This design was pioneered for model\nauthoring by Chainer(Tokui et al. 2015) and Dynet(Neubig et al. 2017).\nPyTorch extends this to all aspects of deep learning workflows. Defining\nlayers, composing models, loading data, running optimizers, and\nparallelizing the training process are all expressed using the familiar\nconcepts developed for general purpose programming." (135:1-148:52, 7211-8140)
 ├─20  paragraph[3] (150:1-165:7, 8142-9138)
 │     ├─0 text "This solution ensures that any new potential neural network architecture\ncan be easily implemented with PyTorch. For instance, layers (which in\nmodern machine learning should really be understood as stateful\nfunctions with implicit parameters) are typically expressed as Python\nclasses whose constructors create and initialize their parameters, and\nwhose forward methods process an input activation. Similarly, models are\nusually represented as classes that compose individual layers, but let\nus state again that nothing forces the user to structure their code in\nthat way. Listing\n" (150:1-159:1, 8142-8724)
 │     ├─1 link[1] (159:1-159:42, 8724-8765)
 │     │   │ title: null
 │     │   │ url: "#lst:code_example"
 │     │   └─0 text "[lst:code_example]" (159:2-159:22, 8725-8745)
 │     └─2 text "{reference-type=\"ref\"\nreference=\"lst:code_example\"} demonstrates how an entire model can be\ncreated by composing functionality provided by PyTorch such as 2d\nconvolution, matrix multiplication, dropout, and softmax to classify\ngray-scale images. Note that linear layers are of course part of the\nlibrary, but we show an example implementation to highlight how simple\nit is." (159:42-165:7, 8765-9138)
 ├─21  paragraph[1] (167:1-169:4, 9140-9163)
 │     └─0 text "::: {.parcolumns}\n2\n:::" (167:1-169:4, 9140-9163)
 ├─22  paragraph[1] (171:1-171:47, 9165-9211)
 │     └─0 text "[]{#lst:code_example label=\"lst:code_example\"}" (171:1-171:47, 9165-9211)
 ├─23  paragraph[3] (173:1-181:65, 9213-9829)
 │     ├─0 text "This \"everything is a just a program\" philosophy is not limited to just\nthe models, and applies to optimizers and data loaders as well. This\nfacilitates the experimentation of new training techniques. For example,\nto implement the very popular generative adversarial networks, one needs\nto specify two separate models (the generator and the discriminator),\nand two loss functions that depend on both models at the same time.\nRigid APIs would struggle with this setup, but the simple design\nemployed in PyTorch easily adapts to this setting as shown in\nListing " (173:1-181:9, 9213-9773)
 │     ├─1 link[1] (181:9-181:22, 9773-9786)
 │     │   │ title: null
 │     │   │ url: "#lst:gan"
 │     │   └─0 text "1" (181:10-181:11, 9774-9775)
 │     └─2 text "{reference-type=\"ref\" reference=\"lst:gan\"}." (181:22-181:65, 9786-9829)
 ├─24  html "<figure id=\"lst:gan\">\n<div class=\"sourceCode\" id=\"cb1\" data-fontsize=\"\\small\"><pre\nclass=\"sourceCode python\"><code class=\"sourceCode python\"><span id=\"cb1-1\"><a href=\"#cb1-1\" aria-hidden=\"true\" tabindex=\"-1\"></a>discriminator <span class=\"op\">=</span> create_discriminator()</span>\n<span id=\"cb1-2\"><a href=\"#cb1-2\" aria-hidden=\"true\" tabindex=\"-1\"></a>generator <span class=\"op\">=</span> create_generator()</span>\n<span id=\"cb1-3\"><a href=\"#cb1-3\" aria-hidden=\"true\" tabindex=\"-1\"></a>optimD <span class=\"op\">=</span> optim.Adam(discriminator.parameters())</span>\n<span id=\"cb1-4\"><a href=\"#cb1-4\" aria-hidden=\"true\" tabindex=\"-1\"></a>optimG <span class=\"op\">=</span> optim.Adam(generator.parameters())</span>\n<span id=\"cb1-5\"><a href=\"#cb1-5\" aria-hidden=\"true\" tabindex=\"-1\"></a></span>\n<span id=\"cb1-6\"><a href=\"#cb1-6\" aria-hidden=\"true\" tabindex=\"-1\"></a><span class=\"kw\">def</span> step(real_sample):</span>\n<span id=\"cb1-7\"><a href=\"#cb1-7\" aria-hidden=\"true\" tabindex=\"-1\"></a>  <span class=\"co\"># (1) Update Discriminator</span></span>\n<span id=\"cb1-8\"><a href=\"#cb1-8\" aria-hidden=\"true\" tabindex=\"-1\"></a>  errD_real <span class=\"op\">=</span> loss(discriminator(real_sample), real_label)</span>\n<span id=\"cb1-9\"><a href=\"#cb1-9\" aria-hidden=\"true\" tabindex=\"-1\"></a>  errD_real.backward()</span>\n<span id=\"cb1-10\"><a href=\"#cb1-10\" aria-hidden=\"true\" tabindex=\"-1\"></a>  fake <span class=\"op\">=</span> generator(get_noise())</span>\n<span id=\"cb1-11\"><a href=\"#cb1-11\" aria-hidden=\"true\" tabindex=\"-1\"></a>  errD_fake <span class=\"op\">=</span> loss(discriminator(fake.detach(), fake_label)</span>\n<span id=\"cb1-12\"><a href=\"#cb1-12\" aria-hidden=\"true\" tabindex=\"-1\"></a>  errD_fake.backward()</span>\n<span id=\"cb1-13\"><a href=\"#cb1-13\" aria-hidden=\"true\" tabindex=\"-1\"></a>  optimD.step()</span>\n<span id=\"cb1-14\"><a href=\"#cb1-14\" aria-hidden=\"true\" tabindex=\"-1\"></a>  <span class=\"co\"># (2) Update Generator</span></span>\n<span id=\"cb1-15\"><a href=\"#cb1-15\" aria-hidden=\"true\" tabindex=\"-1\"></a>  errG <span class=\"op\">=</span> loss(discriminator(fake), real_label)</span>\n<span id=\"cb1-16\"><a href=\"#cb1-16\" aria-hidden=\"true\" tabindex=\"-1\"></a>  errG.backward()</span>\n<span id=\"cb1-17\"><a href=\"#cb1-17\" aria-hidden=\"true\" tabindex=\"-1\"></a>  optimG.step()</span></code></pre></div>\n<p><span id=\"lst:gan\" label=\"lst:gan\"></span></p>\n</figure>" (183:1-203:10, 9831-12190)
 ├─25  paragraph[1] (205:1-211:43, 12192-12644)
 │     └─0 text "Since PyTorch programs execute eagerly, all the features of Python are\navailable throughout the whole design process. Print statements,\nstandard debuggers, and common visualization tools like matplotlib all\nwork as expected. Users do not have to wait for lengthy compilation\nbefore they can start running their programs, and more importantly\nintermediate computations can be observed to understand how a model\nworks and whether its results are correct." (205:1-211:43, 12192-12644)
 ├─26  heading[1] (213:1-213:38, 12646-12683)
 │     │ depth: 2
 │     └─0 text "Interoperability and extensibility" (213:4-213:38, 12649-12683)
 ├─27  paragraph[5] (215:1-226:43, 12685-13508)
 │     ├─0 text "Easy and efficient interoperability is one of the top priorities for\nPyTorch because it opens the possibility to leverage the rich ecosystem\nof Python libraries as part of user programs. Hence, PyTorch allows for\nbidirectional exchange of data with external libraries. For example, it\nprovides a mechanism to convert between NumPy arrays and PyTorch tensors\nusing the " (215:1-220:11, 12685-13053)
 │     ├─1 inlineCode "torch.from_numpy()" (220:11-220:31, 13053-13073)
 │     ├─2 text " function and " (220:31-220:45, 13073-13087)
 │     ├─3 inlineCode ".numpy()" (220:45-220:55, 13087-13097)
 │     └─4 text " tensor method.\nSimilar functionality is also available to exchange data stored using\nthe DLPack (DMLC, n.d.) format. Note that this exchange happens in both\ncases without any data copying -- objects on both sides only describe\nhow to interpret a memory region which is shared among them. Hence,\nthose operations are actually extremely cheap, and take constant time no\nmatter how large the converted arrays are." (220:55-226:43, 13097-13508)
 ├─28  paragraph[15] (228:1-242:30, 13510-14504)
 │     ├─0  text "Moreover, many of the critical systems are designed specifically to be\nextensible. For instance, the automatic differentiation system allows\nusers to add support for custom differentiable functions. To do that\nusers can define a new subclass of " (228:1-231:36, 13510-13755)
 │     ├─1  inlineCode "torch.autograd.Function" (231:36-231:61, 13755-13780)
 │     ├─2  text " that\nimplements " (231:61-232:12, 13780-13797)
 │     ├─3  inlineCode "forward()" (232:12-232:23, 13797-13808)
 │     ├─4  text " and " (232:23-232:28, 13808-13813)
 │     ├─5  inlineCode "backward()" (232:28-232:40, 13813-13825)
 │     ├─6  text " methods, which specify the\nfunction and its derivative (or more formally the vector-Jacobian\nproduct). Similarly new datasets can be added by subclassing\n" (232:40-235:1, 13825-13980)
 │     ├─7  inlineCode "torch.utils.data.Dataset" (235:1-235:27, 13980-14006)
 │     ├─8  text " and implementing two methods: " (235:27-235:58, 14006-14037)
 │     ├─9  inlineCode "__getitem__" (235:58-235:71, 14037-14050)
 │     ├─10 text "\n(the indexing operator) and " (235:71-236:29, 14050-14079)
 │     ├─11 inlineCode "__len__" (236:29-236:38, 14079-14088)
 │     ├─12 text " (the length operator), making\ndatasets behave like (possibly lazy) lists. How these work is completely\nup to the implementer, and many users leverage other Python packages for\ndata loading. The " (236:38-239:19, 14088-14283)
 │     ├─13 inlineCode "DataLoader" (239:19-239:31, 14283-14295)
 │     └─14 text " class consumes objects conforming to this\ninterface and provides an iterator over the data which takes care of\nshuffling, batching, parallelization, and management of pinned CUDA\nmemory to improve throughput." (239:31-242:30, 14295-14504)
 ├─29  paragraph[1] (244:1-247:64, 14506-14773)
 │     └─0 text "Most importantly, users are free to replace any component of PyTorch\nthat does not meet the needs or performance requirements of their\nproject. They are all designed to be completely interchangeable, and\nPyTorch takes great care not to impose any particular solution." (244:1-247:64, 14506-14773)
 ├─30  heading[1] (249:1-249:29, 14775-14803)
 │     │ depth: 2
 │     └─0 text "Automatic differentiation" (249:4-249:29, 14778-14803)
 ├─31  paragraph[1] (251:1-265:62, 14805-15837)
 │     └─0 text "Since gradient based optimization is vital to deep learning, PyTorch\nmust be able to automatically compute gradients of models specified by\nour users, and those can be arbitrary Python programs. However, Python\nis a dynamic programming language that allows changing most behaviors at\nruntime, making ahead of time source-to-source differentiation\ncumbersome. Instead, PyTorch uses the operator overloading approach,\nwhich builds up a representation of the computed function every time it\nis executed. In its current implementation (Paszke et al. 2017), PyTorch\nperforms reverse-mode automatic differentiation, which computes the\ngradient of a scalar output with respect to a multivariate input.\nDifferentiating functions with more outputs than inputs is more\nefficiently executed using forward-mode automatic differentiation, but\nthis use case is less common for machine learning applications. PyTorch\ncan be easily extended to perform forward-mode differentiation using\narray-level dual numbers (Piponi 2004; Leuck and Nagel 1999)." (251:1-265:62, 14805-15837)
 ├─32  paragraph[1] (267:1-279:62, 15839-16750)
 │     └─0 text "Another interesting and uncommon feature of our system is that it can\ndifferentiate through code employing mutation on tensors, which is one\nof the basic building blocks of imperative programs. To ensure safety,\nwe have implemented a versioning system for tensors, which lets us track\ntheir modifications and ensure that we always use the data we expect.\nOne interesting tradeoff is that while we could utilize techniques like\ncopy-on-write to support arbitrary programs, we chose to not go down\nthis path, as performance-wise it is usually beneficial for the users to\nrewrite their code to ensure that no copies have to be performed. Hence,\nwhile most mutations are benign and can be handled automatically, the\nreally complicated cases result in a user error, which lets them know\nthat they likely want to restructure the program. This allows us to\navoid introducing subtle and hard-to-find performance cliffs." (267:1-279:62, 15839-16750)
 ├─33  heading[1] (281:1-281:37, 16752-16788)
 │     │ depth: 1
 │     └─0 text "Performance focused implementation" (281:3-281:37, 16754-16788)
 ├─34  paragraph[1] (283:1-289:22, 16790-17227)
 │     └─0 text "Running deep learning algorithms efficiently from a Python interpreter\nis notoriously challenging: for instance, the global interpreter\nlock (The Python team, n.d.) effectively ensures that only one of any\nnumber of concurrent threads is running at any given time. Deep learning\nframeworks based on the construction of a static data-flow graph\nsidestep this problem by deferring the evaluation of the computation to\na custom interpreter." (283:1-289:22, 16790-17227)
 ├─35  paragraph[1] (291:1-293:52, 17229-17419)
 │     └─0 text "PyTorch solved the problem differently, by carefully optimizing every\naspect of its execution while simultaneously empowering its users to\neasily leverage additional optimization strategies." (291:1-293:52, 17229-17419)
 ├─36  heading[1] (295:1-295:25, 17421-17445)
 │     │ depth: 2
 │     └─0 text "An efficient C++ core" (295:4-295:25, 17424-17445)
 ├─37  paragraph[3] (297:1-310:18, 17447-18374)
 │     ├─0 text "Despite being closely integrated in the Python ecosystem, most of\nPyTorch is written in C++ to achieve high performance. This core\n" (297:1-299:1, 17447-17578)
 │     ├─1 inlineCode "libtorch" (299:1-299:11, 17578-17588)
 │     └─2 text " library implements the tensor data structure, the GPU and CPU\noperators, and basic parallel primitives. It also provides the automatic\ndifferentiation system, including the gradient formulas for most\nbuilt-in functions. This ensures that the computation of the derivatives\nof functions composed of core PyTorch operators is executed entirely in\na multithreaded evaluator which does not require holding the Python\nglobal interpreter lock (The Python team, n.d.). Python bindings are\ngenerated using YAML meta-data files. An interesting side-effect of this\napproach is that it allowed our community to quickly create bindings to\nmultiple other languages resulting in projects like NimTorch (Petrantoni\nand Wollenschläger, n.d.), hasktorch (Huang, Hashimoto, and Stites,\nn.d.) and others." (299:11-310:18, 17588-18374)
 ├─38  paragraph[1] (312:1-317:46, 18376-18756)
 │     └─0 text "This design also allowed us to create first-class C++ bindings and\nmodeling libraries that can be used in places where Python is\ninconvenient, such as the game engine for Starcraft (Synnaeve et al.\n2018) or on mobile platforms. It is even possible to take the Python\ncode describing a PyTorch model and run it without Python using the\nTorchScript engine (The PyTorch team, n.d.b)." (312:1-317:46, 18376-18756)
 ├─39  heading[1] (319:1-319:34, 18758-18791)
 │     │ depth: 2
 │     └─0 text "Separate control and data flow" (319:4-319:34, 18761-18791)
 ├─40  paragraph[1] (321:1-326:29, 18793-19170)
 │     └─0 text "PyTorch maintains a strict separation between its control (i.e. program\nbranches, loops) and data flow (i.e. tensors and the operations\nperformed on them). The resolution of the control flow is handled by\nPython and optimized C++ code executed on the host CPU, and result in a\nlinear sequence of operator invocations on the device. Operators can be\nrun either on CPU or on GPU." (321:1-326:29, 18793-19170)
 ├─41  paragraph[1] (328:1-337:24, 19172-19827)
 │     └─0 text "PyTorch is designed to execute operators asynchronously on GPU by\nleveraging the CUDA stream mechanism (Luitjens 2014) to queue CUDA\nkernel invocations to the GPUs hardware FIFO. This allows the system to\noverlap the execution of Python code on CPU with tensor operators on\nGPU. Because the tensor operations usually take a significant amount of\ntime, this lets us saturate the GPU and reach peak performance even in\nan interpreted language with fairly high overhead like Python. Note that\nthis mechanism is nearly invisible to the user. Unless they implement\ntheir own multi-stream primitives all of the CPU-GPU synchronization is\nhandled by the library." (328:1-337:24, 19172-19827)
 ├─42  paragraph[1] (339:1-342:25, 19829-20054)
 │     └─0 text "PyTorch could leverage a similar mechanism to also execute operators\nasynchronously on the CPU. However the costs of cross-thread\ncommunication and synchronization would negate the performance benefit\nof such an optimization." (339:1-342:25, 19829-20054)
 ├─43  heading[1] (344:1-344:35, 20056-20090)
 │     │ depth: 2
 │     └─0 text "Custom caching tensor allocator" (344:4-344:35, 20059-20090)
 ├─44  paragraph[3] (346:1-357:50, 20092-20903)
 │     ├─0 text "Almost every operator must dynamically allocate an output tensor to hold\nthe result of its execution. It is therefore critical to optimize the\nspeed of the dynamic memory allocators. PyTorch can rely on optimized\nlibraries (Berger et al. 2000; Evans May 2006; Ghemawat and Menage,\nn.d.) to handle this task on CPU. However, on GPU the " (346:1-350:55, 20092-20427)
 │     ├─1 inlineCode "cudaFree" (350:55-350:65, 20427-20437)
 │     └─2 text " routine\nmay block its caller until all previously queued work on all GPUs\ncompletes. To avoid this bottleneck, PyTorch implements a custom\nallocator which incrementally builds up a cache of CUDA memory and\nreassigns it to later allocations without further use of CUDA APIs. The\nincremental allocation is also crucial for better interoperability,\nbecause taking up all GPU memory ahead of time would prevent the user\nfrom utilizing other GPU-enabled Python packages." (350:65-357:50, 20437-20903)
 ├─45  paragraph[1] (359:1-363:14, 20905-21204)
 │     └─0 text "To further improve its effectiveness, this allocator was tuned for the\nspecific memory usage patterns of deep learning. For example, it rounds\nup allocations to multiples of 512 bytes to avoid fragmentation issues.\nMoreover, it maintains a distinct pool of memory for every CUDA stream\n(work queue)." (359:1-363:14, 20905-21204)
 ├─46  paragraph[3] (365:1-373:60, 21206-21825)
 │     ├─0 text "The one-pool-per-stream design assumption simplifies the implementation\nand improves the performance of the allocator: because the CPU runs\nahead of the GPU, memory is freed on the CPU " (365:1-367:46, 21206-21391)
 │     ├─1 emphasis[1] (367:46-367:54, 21391-21399)
 │     │   └─0 text "before" (367:47-367:53, 21392-21398)
 │     └─2 text " its last use on\nthe GPU finishes. Since streams serialize execution, if the free\nprecedes the reallocation on the CPU, the same order will occur on the\nGPU. So the allocator can reallocate memory freed on the CPU immediately\nas long as the new allocation is used on the same stream as the freed\nregion. However, if an allocation was last used on one stream and then\nallocated on another, additional synchronization is needed." (367:54-373:60, 21399-21825)
 ├─47  paragraph[1] (375:1-383:33, 21827-22418)
 │     └─0 text "The one-pool-per-stream design seems limiting since the allocations end\nup fragmented per stream, but in practice PyTorch almost never uses\nmultiple streams. It is notoriously hard to write CUDA kernels in a way\nthat would let them cooperatively share the GPU because exact scheduling\nis hardware controlled. In practice, kernel writers usually resort to\nmonolithic kernels that combine multiple tasks. Data loading and\ndistributed computing utilities are exceptions to the one stream design,\nand they carefully insert additional synchronization to avoid bad\ninteractions with the allocator." (375:1-383:33, 21827-22418)
 ├─48  paragraph[1] (385:1-387:32, 22420-22590)
 │     └─0 text "While this design is susceptible to certain corner cases, it almost\nnever exhibits unwanted behaviors in practical code. Most of our users\nare not aware of its existence." (385:1-387:32, 22420-22590)
 ├─49  heading[1] (389:1-389:19, 22592-22610)
 │     │ depth: 2
 │     └─0 text "Multiprocessing" (389:4-389:19, 22595-22610)
 ├─50  paragraph[3] (391:1-396:26, 22612-22985)
 │     ├─0 text "Due to the global interpreter lock (GIL) Python's default implementation\ndoes not allow concurrent threads to execute in parallel. To alleviate\nthis problem, the Python community has established a standard\n" (391:1-394:1, 22612-22818)
 │     ├─1 inlineCode "multiprocessing" (394:1-394:18, 22818-22835)
 │     └─2 text " module, containing a number of utilities that allow\nusers to easily spawn child processes and implement basic inter-process\ncommunication primitives." (394:18-396:26, 22835-22985)
 ├─51  paragraph[5] (398:1-404:43, 22987-23435)
 │     ├─0 text "However, the implementation of the primitives uses the same form of\nserialization used for on-disk persistence, which is inefficient when\ndealing with large arrays. Hence, PyTorch extends the Python\n" (398:1-401:1, 22987-23186)
 │     ├─1 inlineCode "multiprocessing" (401:1-401:18, 23186-23203)
 │     ├─2 text " module into " (401:18-401:31, 23203-23216)
 │     ├─3 inlineCode "torch.multiprocessing" (401:31-401:54, 23216-23239)
 │     └─4 text ", which is a\ndrop-in replacement for the built in package and automatically moves the\ndata of tensors sent to other processes to shared memory instead of\nsending it over the communication channel." (401:54-404:43, 23239-23435)
 ├─52  paragraph[1] (406:1-410:45, 23437-23759)
 │     └─0 text "This design greatly improves performance and makes the process isolation\nweaker, resulting in a programming model which more closely resembles\nregular threaded programs. Users can easily implement heavily parallel\nprograms that operate on independent GPUs but later synchronize\ngradients using all-reduce style primitives." (406:1-410:45, 23437-23759)
 ├─53  paragraph[1] (412:1-414:29, 23761-23929)
 │     └─0 text "Another unique feature of this system is that it transparently handles\nsharing of CUDA tensors, making it easy to implement techniques like\nHogwild (Recht et al. 2011)." (412:1-414:29, 23761-23929)
 ├─54  heading[1] (416:1-416:22, 23931-23952)
 │     │ depth: 2
 │     └─0 text "Reference counting" (416:4-416:22, 23934-23952)
 ├─55  paragraph[1] (418:1-421:69, 23954-24236)
 │     └─0 text "Users often design their models to utilize all memory available during\ntraining, and increasing batch sizes is a common technique of speeding\nup the process. Therefore, to deliver great performance, PyTorch has to\ntreat memory as a scarce resource that it needs to manage carefully." (418:1-421:69, 23954-24236)
 ├─56  paragraph[1] (423:1-433:50, 24238-24993)
 │     └─0 text "Libraries with eager semantics have to manage tensor memory without\nknowing how it will be used in the future. Garbage collection is the\ntypical way to handle this automatically because it has good amortized\nperformance. In this approach, the runtime periodically investigates the\nstate of the system, enumerates used objects and frees everything else.\nHowever, by deferring the deallocation, it causes the program to use\nmore memory overall (Hertz and Berger 2005). Given the scarcity of GPU\nmemory, these overheads are unacceptable. In fact, Torch7 utilized the\ngarbage collector built into Lua, and a common anti-pattern among the\nusers was to sprinkle the program with explicit triggers to the garbage\ncollector, hoping that the memory errors go away." (423:1-433:50, 24238-24993)
 ├─57  paragraph[5] (435:1-441:50, 24995-25464)
 │     ├─0 text "PyTorch takes a different approach: it relies on a reference counting\nscheme to track the number of uses of each tensor, and frees the\nunderlying memory " (435:1-437:19, 24995-25148)
 │     ├─1 emphasis[1] (437:19-437:32, 25148-25161)
 │     │   └─0 text "immediately" (437:20-437:31, 25149-25160)
 │     ├─2 text " once this count reaches zero. Note that\nPyTorch tracks both references internal to the " (437:32-438:48, 25161-25249)
 │     ├─3 inlineCode "libtorch" (438:48-438:58, 25249-25259)
 │     └─4 text " library and\nexternal references made by users in their Python code by integrating\nwith Python's own reference counting mechanism. This ensures that memory\nis released exactly when tensors become unneeded." (438:58-441:50, 25259-25464)
 ├─58  paragraph[1] (443:1-449:69, 25466-25949)
 │     └─0 text "One notable caveat is that we can only guarantee the desired performance\ncharacteristics in implementations of languages that either already\nutilize reference counting (CPython, Swift, but not PyPy or many\nscripting languages such as Lua), and those that allow for user-defined\nbehavior for assignment, copies, and moves (e.g. C++, Rust). Bindings to\nimplementations that do not satisfy those criteria will have to\nimplement their own specialized memory management on top of PyTorch." (443:1-449:69, 25466-25949)
 ├─59  heading[1] (451:1-451:13, 25951-25963)
 │     │ depth: 1
 │     └─0 text "Evaluation" (451:3-451:13, 25953-25963)
 ├─60  paragraph[1] (453:1-457:25, 25965-26268)
 │     └─0 text "In this section we compare the performance of PyTorch with several other\ncommonly-used deep learning libraries, and find that it achieves\ncompetitive performance across a range of tasks. All experiments were\nperformed on a workstation with two Intel Xeon E5-2698 v4 CPUs and one\nNVIDIA Quadro GP100 GPU." (453:1-457:25, 25965-26268)
 ├─61  heading[1] (459:1-459:25, 26270-26294)
 │     │ depth: 2
 │     └─0 text "Asynchronous dataflow" (459:4-459:25, 26273-26294)
 ├─62  paragraph[1] (461:1-464:27, 26296-26539)
 │     └─0 text "We start by quantifying the ability of PyTorch to asynchronously execute\ndataflow on GPU. We use the built-in profiler (The PyTorch team, n.d.a)\nto instrument various benchmarks and record a timeline of the execution\nof a single training step." (461:1-464:27, 26296-26539)
 ├─63  paragraph[3] (466:1-476:56, 26541-27221)
 │     ├─0 text "Figure\n" (466:1-467:1, 26541-26548)
 │     ├─1 link[1] (467:1-467:48, 26548-26595)
 │     │   │ title: null
 │     │   │ url: "#fig:async_execution"
 │     │   └─0 text "[fig:async_execution]" (467:2-467:25, 26549-26572)
 │     └─2 text "{reference-type=\"ref\"\nreference=\"fig:async_execution\"} shows a representative timeline of\nexecution for the first few operations of a ResNet-50 model. The host\nCPU which queues the work quickly outpaces the execution of the\noperators on the GPU. This allows PyTorch to achieve almost perfect\ndevice utilization. In this example, GPU execution takes around three\ntimes longer than CPU scheduling. The exact ratio depends on the\nrelative performance of the host CPU and the GPU, as well as the number\nof elements in each tensor and the average arithmetic complexity of the\nfloating point computations to be performed on the GPU." (467:48-476:56, 26595-27221)
 ├─64  paragraph[3] (478:1-480:4, 27223-27297)
 │     ├─0 text "::: {.center}\n" (478:1-479:1, 27223-27237)
 │     ├─1 image (479:1-479:36, 27237-27272)
 │     │     title: null
 │     │     url: "async_kernel_launches.pdf"
 │     │     alt: "image"
 │     └─2 text "{width=\"\\textwidth\"}\n:::" (479:36-480:4, 27272-27297)
 ├─65  heading[1] (482:1-482:21, 27299-27319)
 │     │ depth: 2
 │     └─0 text "Memory management" (482:4-482:21, 27302-27319)
 ├─66  paragraph[7] (484:1-494:70, 27321-28091)
 │     ├─0 text "We used the NVIDIA profiler to trace the execution of the CUDA runtime\nas well as the execution of the CUDA kernels launched during one\ntraining iteration of the ResNet-50 model. As shown in\nFigure " (484:1-487:8, 27321-27519)
 │     ├─1 link[1] (487:8-487:71, 27519-27582)
 │     │   │ title: null
 │     │   │ url: "#fig:resnet_annotated_traces"
 │     │   └─0 text "[fig:resnet_annotated_traces]" (487:9-487:40, 27520-27551)
 │     ├─2 text "{reference-type=\"ref\"\nreference=\"fig:resnet_annotated_traces\"}, the behavior of the first\niteration differs significantly from that of subsequent ones. At first,\ncalls to the CUDA memory management functions (" (487:71-490:48, 27582-27791)
 │     ├─3 inlineCode "cudaMalloc" (490:48-490:60, 27791-27803)
 │     ├─4 text " and\n" (490:60-491:1, 27803-27808)
 │     ├─5 inlineCode "cudaFree" (491:1-491:11, 27808-27818)
 │     └─6 text ") slow down the execution quite dramatically by blocking the\nCPU thread for long periods of time, hence lowering the utilization of\nthe GPU. This effect disappears in subsequent iterations as the PyTorch\ncaching memory allocator starts reusing previously allocated regions." (491:11-494:70, 27818-28091)
 ├─67  paragraph[3] (496:1-498:4, 28093-28171)
 │     ├─0 text "::: {.center}\n" (496:1-497:1, 28093-28107)
 │     ├─1 image (497:1-497:40, 28107-28146)
 │     │     title: null
 │     │     url: "resnet50_annotated_traces.pdf"
 │     │     alt: "image"
 │     └─2 text "{width=\"\\textwidth\"}\n:::" (497:40-498:4, 28146-28171)
 ├─68  heading[1] (500:1-500:14, 28173-28186)
 │     │ depth: 2
 │     └─0 text "Benchmarks" (500:4-500:14, 28176-28186)
 ├─69  paragraph[1] (502:1-506:66, 28188-28528)
 │     └─0 text "Finally, we can get an overall sense of single-machine eager mode\nperformance of PyTorch by comparing it to three popular graph-based deep\nlearning frameworks (CNTK, MXNet and TensorFlow), a define-by-run\nframework (Chainer), and production oriented platform (PaddlePaddle).\nThe Appendix details all the steps needed to reproduce our setup." (502:1-506:66, 28188-28528)
 ├─70  paragraph[3] (508:1-514:11, 28530-28892)
 │     ├─0 text "Our results are summarized in\nTable " (508:1-509:7, 28530-28566)
 │     ├─1 link[1] (509:7-509:34, 28566-28593)
 │     │   │ title: null
 │     │   │ url: "#detailed_perf_results"
 │     │   └─0 text "1" (509:8-509:9, 28567-28568)
 │     └─2 text "{reference-type=\"ref\"\nreference=\"detailed_perf_results\"}. On all the benchmarks, the\nperformance of PyTorch is within 17% of that of of the fastest\nframework. We attribute this result to the fact that these tools offload\nmost of the computation to the same version of the cuDNN and cuBLAS\nlibraries." (509:34-514:11, 28593-28892)
 ├─71  paragraph[3] (516:1-526:179, 28894-30712)
 │     ├─0 text "::: {#detailed_perf_results}\n|              |                                 |                      |                      |                       |                            |                            |\n|:-------------|:-------------------------------:|:--------------------:|:--------------------:|:---------------------:|:--------------------------:|:--------------------------:|\n| Framework    | " (516:1-519:18, 28894-29298)
 │     ├─1 emphasis[1] (519:18-519:49, 29298-29329)
 │     │   └─0 text "Throughput (higher is better)" (519:19-519:48, 29299-29328)
 │     └─2 text " |                      |                      |                       |                            |                            |\n|              |             AlexNet             |        VGG-19        |      ResNet-50       |       MobileNet       |           GNMTv2           |            NCF             |\n| Chainer      |          $778 \\pm 15$           |         N/A          | $\\textbf{219} \\pm 1$ |          N/A          |            N/A             |            N/A             |\n| CNTK         |          $845 \\pm{8}$           |     $84 \\pm{3}$      |     $210 \\pm{1}$     |          N/A          |            N/A             |            N/A             |\n| MXNet        |     $\\textbf{1554} \\pm 22$      |     $113 \\pm 1$      |     $218 \\pm 2$      |      $444 \\pm 2$      |            N/A             |            N/A             |\n| PaddlePaddle |          $933\\pm{123}$          |     $112 \\pm{2}$     |     $192 \\pm{4}$     | $\\textbf{557}\\pm{24}$ |            N/A             |            N/A             |\n| TensorFlow   |          $1422 \\pm 27$          |      $66 \\pm 2$      |     $200 \\pm 1$      |     $216 \\pm 15$      |      $9631 \\pm 1.3%$      |     $4.8e6 \\pm 2.9%$      |\n| PyTorch      |         $1547 \\pm 316$          | $\\textbf{119} \\pm 1$ |     $212 \\pm 2$      |     $463 \\pm 17$      | $\\textbf{15512} \\pm 4.8%$ | $\\textbf{5.4e6} \\pm 3.4%$ |" (519:49-526:179, 29329-30712)
 ├─72  paragraph[1] (528:1-533:4, 30714-31006)
 │     └─0 text "Training speed for 6 models using 32bit floats. Throughput is measured\nin images per second for the AlexNet, VGG-19, ResNet-50, and MobileNet\nmodels, in tokens per second for the GNMTv2 model, and in samples per\nsecond for the NCF model. The fastest speed for each model is shown in\nbold.\n:::" (528:1-533:4, 30714-31006)
 ├─73  heading[1] (535:1-535:12, 31008-31019)
 │     │ depth: 2
 │     └─0 text "Adoption" (535:4-535:12, 31011-31019)
 ├─74  paragraph[3] (537:1-548:34, 31021-31811)
 │     ├─0 text "The validity of design decisions and their impact on ease-of-use is hard\nto measure. As a proxy, we tried to quantify how well the machine\nlearning community received PyTorch by counting how often various\nmachine learning tools (including Caffe, Chainer, CNTK, Keras, MXNet,\nPyTorch, TensorFlow, and Theano) are mentioned on arXiv e-Prints since\nthe initial release of PyTorch in January 2017. In Figure\n" (537:1-543:1, 31021-31425)
 │     ├─1 link[1] (543:1-543:54, 31425-31478)
 │     │   │ title: null
 │     │   │ url: "#fig:pytorch_references"
 │     │   └─0 text "[fig:pytorch_references]" (543:2-543:28, 31426-31452)
 │     └─2 text "{reference-type=\"ref\"\nreference=\"fig:pytorch_references\"} we report the monthly number of\nmentions of the word \"PyTorch\" as a percentage of all mentions among\nthese deep learning frameworks. We counted tools mentioned multiple\ntimes in a given paper only once, and made the search case insensitive\nto account for various spellings." (543:54-548:34, 31478-31811)
 ├─75  paragraph[3] (550:1-552:4, 31813-31880)
 │     ├─0 text "::: {.center}\n" (550:1-551:1, 31813-31827)
 │     ├─1 image (551:1-551:29, 31827-31855)
 │     │     title: null
 │     │     url: "arxiv_mentions.pdf"
 │     │     alt: "image"
 │     └─2 text "{width=\"\\linewidth\"}\n:::" (551:29-552:4, 31855-31880)
 ├─76  heading[1] (554:1-554:29, 31882-31910)
 │     │ depth: 1
 │     └─0 text "Conclusion and future work" (554:3-554:29, 31884-31910)
 ├─77  paragraph[1] (556:1-566:17, 31912-32615)
 │     └─0 text "PyTorch has become a popular tool in the deep learning research\ncommunity by combining a focus on usability with careful performance\nconsiderations. In addition to continuing to support the latest trends\nand advances in deep learning, in the future we plan to continue to\nimprove the speed and scalability of PyTorch. Most notably, we are\nworking on the PyTorch JIT: a suite of tools that allow PyTorch programs\nto be executed outside of the Python interpreter where they can be\nfurther optimized. We also intend to improve support for distributed\ncomputation by providing efficient primitives for data parallelism as\nwell as a Pythonic library for model parallelism based around remote\nprocedure calls." (556:1-566:17, 31912-32615)
 ├─78  heading[1] (568:1-568:19, 32617-32635)
 │     │ depth: 1
 │     └─0 text "Acknowledgements" (568:3-568:19, 32619-32635)
 ├─79  paragraph[1] (570:1-587:57, 32637-33864)
 │     └─0 text "We are grateful to the PyTorch community for their feedback and\ncontributions that greatly influenced the design and implementation of\nPyTorch. We thank all the PyTorch core team members, contributors and\npackage maintainers including Ailing Zhang, Alex Suhan, Alfredo Mendoza,\nAlican Bozkurt, Andrew Tulloch, Ansha Yu, Anthony Shoumikhin, Bram\nWasti, Brian Vaughan, Christian Puhrsch, David Reiss, David Riazati,\nDavide Libenzi, Dmytro Dzhulgakov, Dwaraj Rajagopal, Edward Yang, Elias\nEllison, Fritz Obermeyer, George Zhang, Hao Lu, Hong Xu, Hung Duong,\nIgor Fedan, Ilia Cherniavskii, Iurii Zdebskyi, Ivan Kobzarev, James\nReed, Jeff Smith, Jerry Chen, Jerry Zhang, Jiakai Liu, Johannes M.\nDieterich, Karl Ostmo, Lin Qiao, Martin Yuan, Michael Suo, Mike Ruberry,\nMikhail Zolothukhin, Mingzhe Li, Neeraj Pradhan, Nick Korovaiko, Owen\nAnderson, Pavel Belevich, Peter Johnson, Pritam Damania, Raghuraman\nKrishnamoorthi, Richard Zou, Roy Li, Rui Zhu, Sebastian Messmer, Shen\nLi, Simon Wang, Supriya Rao, Tao Xu, Thomas Viehmann, Vincent\nQuenneville-Belair, Vishwak Srinivasan, Vitaly Fedyunin, Wanchao Liang,\nWei Yang, Will Feng, Xiaomeng Yang, Xiaoqiang Zheng, Xintao Chen,\nYangqing Jia, Yanli Zhao, Yinghai Lu and Zafar Takhirov." (570:1-587:57, 32637-33864)
 ├─80  paragraph[3] (589:1-595:4, 33866-34165)
 │     ├─0 text "::: {#refs .references .csl-bib-body .hanging-indent}\n::: {#ref-TF .csl-entry}\nAbadi, Martı́n, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,\nCraig Citro, Greg S. Corrado, et al. 2015. \"TensorFlow: Large-Scale\nMachine Learning on Heterogeneous Systems.\"\n" (589:1-594:1, 33866-34131)
 │     ├─1 link[1] (594:1-594:30, 34131-34160)
 │     │   │ title: null
 │     │   │ url: "https://www.tensorflow.org/"
 │     │   └─0 text "https://www.tensorflow.org/" (594:2-594:29, 34132-34159)
 │     └─2 text ".\n:::" (594:30-595:4, 34160-34165)
 ├─81  paragraph[1] (597:1-600:4, 34167-34271)
 │     └─0 text "::: {#ref-APL .csl-entry}\nAbrams, Philip S. 1970. \"An APL Machine.\" PhD thesis, Stanford\nUniversity.\n:::" (597:1-600:4, 34167-34271)
 ├─82  paragraph[5] (602:1-605:4, 34273-34402)
 │     ├─0 text "::: {#ref-jax .csl-entry}\nal., Matthew Johnson et. 2018. \"Jax.\" " (602:1-603:39, 34273-34337)
 │     ├─1 emphasis[1] (603:39-603:58, 34337-34356)
 │     │   └─0 text "GitHub Repository" (603:40-603:57, 34338-34355)
 │     ├─2 text ".\n" (603:58-604:1, 34356-34358)
 │     ├─3 link[1] (604:1-604:32, 34358-34389)
 │     │   │ title: null
 │     │   │ url: "https://github.com/google/jax"
 │     │   └─0 text "https://github.com/google/jax" (604:2-604:31, 34359-34388)
 │     └─4 text "; GitHub.\n:::" (604:32-605:4, 34389-34402)
 ├─83  paragraph[5] (607:1-610:4, 34404-34537)
 │     ├─0 text "::: {#ref-flux .csl-entry}\nal., Mike Innes et. 2018. \"Flux.jl.\" " (607:1-608:38, 34404-34468)
 │     ├─1 emphasis[1] (608:38-608:57, 34468-34487)
 │     │   └─0 text "GitHub Repository" (608:39-608:56, 34469-34486)
 │     ├─2 text ".\n" (608:57-609:1, 34487-34489)
 │     ├─3 link[1] (609:1-609:36, 34489-34524)
 │     │   │ title: null
 │     │   │ url: "https://github.com/FluxML/Flux.jl"
 │     │   └─0 text "https://github.com/FluxML/Flux.jl" (609:2-609:35, 34490-34523)
 │     └─4 text "; GitHub.\n:::" (609:36-610:4, 34524-34537)
 ├─84  paragraph[5] (612:1-617:4, 34539-34837)
 │     ├─0 text "::: {#ref-autodiff_survey .csl-entry}\nBaydin, Atilim Gunes, Barak A. Pearlmutter, Alexey Andreyevich Radul,\nand Jeffrey Mark Siskind. 2017. \"Automatic Differentiation in Machine\nLearning: A Survey.\" " (612:1-615:22, 34539-34738)
 │     ├─1 emphasis[1] (615:22-615:44, 34738-34760)
 │     │   └─0 text "J. Mach. Learn. Res." (615:23-615:43, 34739-34759)
 │     ├─2 text " 18 (1): 5595--5637.\n" (615:44-616:1, 34760-34781)
 │     ├─3 link[1] (616:1-616:52, 34781-34832)
 │     │   │ title: null
 │     │   │ url: "http://dl.acm.org/citation.cfm?id=3122009.3242010"
 │     │   └─0 text "http://dl.acm.org/citation.cfm?id=3122009.3242010" (616:2-616:51, 34782-34831)
 │     └─4 text ".\n:::" (616:52-617:4, 34832-34837)
 ├─85  paragraph[5] (619:1-626:4, 34839-35237)
 │     ├─0 text "::: {#ref-hoard .csl-entry}\nBerger, Emery D., Kathryn S. McKinley, Robert D. Blumofe, and Paul R.\nWilson. 2000. \"Hoard: A Scalable Memory Allocator for Multithreaded\nApplications.\" In " (619:1-622:19, 34839-35023)
 │     ├─1 emphasis[1] (622:19-623:71, 35023-35147)
 │     │   └─0 text "Proceedings of the Ninth International Conference on\nArchitectural Support for Programming Languages and Operating Systems" (622:20-623:70, 35024-35146)
 │     ├─2 text ",\n117--28. ASPLOS IX. New York, NY, USA: ACM.\n" (623:71-625:1, 35147-35193)
 │     ├─3 link[1] (625:1-625:40, 35193-35232)
 │     │   │ title: null
 │     │   │ url: "https://doi.org/10.1145/378993.379232"
 │     │   └─0 text "https://doi.org/10.1145/378993.379232" (625:2-625:39, 35194-35231)
 │     └─4 text ".\n:::" (625:40-626:4, 35232-35237)
 ├─86  paragraph[5] (628:1-632:4, 35239-35459)
 │     ├─0 text "::: {#ref-Julia .csl-entry}\nBezanson, Jeff, Alan Edelman, Stefan Karpinski, and Viral B Shah. 2017.\n\"Julia: A Fresh Approach to Numerical Computing.\" " (628:1-630:51, 35239-35389)
 │     ├─1 emphasis[1] (630:51-630:64, 35389-35402)
 │     │   └─0 text "SIAM Review" (630:52-630:63, 35390-35401)
 │     ├─2 text " 59 (1):\n65--98. " (630:64-631:9, 35402-35419)
 │     ├─3 link[1] (631:9-631:44, 35419-35454)
 │     │   │ title: null
 │     │   │ url: "https://doi.org/10.1137/141000671"
 │     │   └─0 text "https://doi.org/10.1137/141000671" (631:10-631:43, 35420-35453)
 │     └─4 text ".\n:::" (631:44-632:4, 35454-35459)
 ├─87  paragraph[3] (634:1-638:4, 35461-35691)
 │     ├─0 text "::: {#ref-cudnn .csl-entry}\nChetlur, Sharan, Cliff Woolley, Philippe Vandermersch, Jonathan D.\nCohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. \"cuDNN:\nEfficient Primitives for Deep Learning.\" " (634:1-637:42, 35461-35666)
 │     ├─1 emphasis[1] (637:42-637:48, 35666-35672)
 │     │   └─0 text "CoRR" (637:43-637:47, 35667-35671)
 │     └─2 text " abs/1410.0759.\n:::" (637:48-638:4, 35672-35691)
 ├─88  paragraph[1] (640:1-643:4, 35693-35844)
 │     └─0 text "::: {#ref-Torch .csl-entry}\nCollobert, Ronan, Samy Bengio, and Johnny Mariéthoz. 2002. \"Torch: A\nModular Machine Learning Software Library.\" Idiap.\n:::" (640:1-643:4, 35693-35844)
 ├─89  paragraph[3] (645:1-648:4, 35846-36016)
 │     ├─0 text "::: {#ref-Torch7 .csl-entry}\nCollobert, Ronan, Koray Kavukcuoglu, and Clément Farabet. 2011. \"Torch7:\nA Matlab-Like Environment for Machine Learning.\" In " (645:1-647:53, 35846-36000)
 │     ├─1 emphasis[1] (647:53-647:64, 36000-36011)
 │     │   └─0 text "NIPS 2011" (647:54-647:63, 36001-36010)
 │     └─2 text ".\n:::" (647:64-648:4, 36011-36016)
 ├─90  paragraph[1] (650:1-652:4, 36018-36104)
 │     └─0 text "::: {#ref-dlpack .csl-entry}\nDMLC. n.d. \"DLPack: Open in Memory Tensor Structure.\"\n:::" (650:1-652:4, 36018-36104)
 ├─91  paragraph[5] (654:1-659:4, 36106-36357)
 │     ├─0 text "::: {#ref-jemalloc .csl-entry}\nEvans, J. May 2006. \"A Scalable Concurrent Malloc(3) Implementation for\nFreeBSD.\" In " (654:1-656:14, 36106-36222)
 │     ├─1 emphasis[1] (656:14-656:58, 36222-36266)
 │     │   └─0 text "In BSDCan --- the Technical BSD Conference" (656:15-656:57, 36223-36265)
 │     ├─2 text ". Ottawa,\nCanada.\n" (656:58-658:1, 36266-36284)
 │     ├─3 link[1] (658:1-658:69, 36284-36352)
 │     │   │ title: null
 │     │   │ url: "http://people.freebsd.org/˜jasone/jemalloc/bsdcan2006/jemalloc.pdf"
 │     │   └─0 text "http://people.freebsd.org/˜jasone/jemalloc/bsdcan2006/jemalloc.pdf" (658:2-658:68, 36285-36351)
 │     └─4 text ".\n:::" (658:69-659:4, 36352-36357)
 ├─92  paragraph[1] (661:1-663:4, 36359-36454)
 │     └─0 text "::: {#ref-worse_is_better .csl-entry}\nGabriel, Richard. n.d. \"The Rise of Worse Is Better.\"\n:::" (661:1-663:4, 36359-36454)
 ├─93  paragraph[3] (665:1-668:4, 36456-36618)
 │     ├─0 text "::: {#ref-tcmalloc .csl-entry}\nGhemawat, S., and P. Menage. n.d. \"Tcmalloc: Thread-Caching Malloc.\"\n" (665:1-667:1, 36456-36556)
 │     ├─1 link[1] (667:1-667:58, 36556-36613)
 │     │   │ title: null
 │     │   │ url: "http://goog-perftools.sourceforge.net/doc/tcmalloc.html"
 │     │   └─0 text "http://goog-perftools.sourceforge.net/doc/tcmalloc.html" (667:2-667:57, 36557-36612)
 │     └─2 text ".\n:::" (667:58-668:4, 36613-36618)
 ├─94  paragraph[1] (670:1-673:4, 36620-36739)
 │     └─0 text "::: {#ref-eigenweb .csl-entry}\nGuennebaud, Gaël, Benoît Jacob, et al. 2010. \"Eigen V3.\"\nhttp://eigen.tuxfamily.org.\n:::" (670:1-673:4, 36620-36739)
 ├─95  paragraph[5] (675:1-681:4, 36741-37129)
 │     ├─0 text "::: {#ref-garbage_collection .csl-entry}\nHertz, Matthew, and Emery D. Berger. 2005. \"Quantifying the Performance\nof Garbage Collection Vs. Explicit Memory Management.\" In " (675:1-677:59, 36741-36912)
 │     ├─1 emphasis[1] (677:59-679:51, 36912-37036)
 │     │   └─0 text "Proceedings\nof the 20th Annual ACM SIGPLAN Conference on Object-Oriented\nProgramming, Systems, Languages, and Applications" (677:60-679:50, 36913-37035)
 │     ├─2 text ", 313--26. OOPSLA '05.\nNew York, NY, USA: ACM. " (679:51-680:25, 37036-37083)
 │     ├─3 link[1] (680:25-680:66, 37083-37124)
 │     │   │ title: null
 │     │   │ url: "https://doi.org/10.1145/1094811.1094836"
 │     │   └─0 text "https://doi.org/10.1145/1094811.1094836" (680:26-680:65, 37084-37123)
 │     └─4 text ".\n:::" (680:66-681:4, 37124-37129)
 ├─96  paragraph[1] (683:1-685:4, 37131-37232)
 │     └─0 text "::: {#ref-hasktorch .csl-entry}\nHuang, Austin, Junji Hashimoto, and Sam Stites. n.d. \"HaskTorch.\"\n:::" (683:1-685:4, 37131-37232)
 ├─97  paragraph[3] (687:1-692:4, 37234-37525)
 │     ├─0 text "::: {#ref-Caffe .csl-entry}\n\"Jia, Yangqing, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan\nLong, Ross Girshick, Sergio Guadarrama, and Trevor\" Darrell. \"2014\".\n\"\"Caffe: Convolutional Architecture for Fast Feature Embedding\".\"\n" (687:1-691:1, 37234-37474)
 │     ├─1 emphasis[1] (691:1-691:37, 37474-37510)
 │     │   └─0 text "\"arXiv Preprint arXiv:1408.5093\"" (691:2-691:36, 37475-37509)
 │     └─2 text ", \"2014\".\n:::" (691:37-692:4, 37510-37525)
 ├─98  paragraph[1] (694:1-697:4, 37527-37670)
 │     └─0 text "::: {#ref-SciPy .csl-entry}\nJones, Eric, Travis Oliphant, Pearu Peterson, et al. 2001--. \"SciPy:\nOpen Source Scientific Tools for Python.\"\n:::" (694:1-697:4, 37527-37670)
 ├─99  paragraph[1] (699:1-702:4, 37672-37804)
 │     └─0 text "::: {#ref-maxdnn .csl-entry}\nLavin, Andrew. 2015. \"maxDNN: An Efficient Convolution Kernel for Deep\nLearning with Maxwell GPUs.\"\n:::" (699:1-702:4, 37672-37804)
 ├─100 paragraph[3] (704:1-708:4, 37806-38014)
 │     ├─0 text "::: {#ref-fast_cnn .csl-entry}\nLavin, Andrew, and Scott Gray. 2016. \"Fast Algorithms for Convolutional\nNeural Networks.\" " (704:1-706:19, 37806-37927)
 │     ├─1 emphasis[1] (706:19-707:20, 37927-37999)
 │     │   └─0 text "2016 IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR)" (706:20-707:19, 37928-37998)
 │     └─2 text ", 4013--21.\n:::" (707:20-708:4, 37999-38014)
 ├─101 paragraph[3] (710:1-714:4, 38016-38193)
 │     ├─0 text "::: {#ref-mnist .csl-entry}\nLeCun, Yann, and Corinna Cortes. n.d. \"MNIST Handwritten Digit\nDatabase.\" http://yann.lecun.com/exdb/mnist/.\n" (710:1-713:1, 38016-38153)
 │     ├─1 link[1] (713:1-713:36, 38153-38188)
 │     │   │ title: null
 │     │   │ url: "http://yann.lecun.com/exdb/mnist/"
 │     │   └─0 text "http://yann.lecun.com/exdb/mnist/" (713:2-713:35, 38154-38187)
 │     └─2 text ".\n:::" (713:36-714:4, 38188-38193)
 ├─102 paragraph[1] (716:1-719:4, 38195-38327)
 │     └─0 text "::: {#ref-Lush .csl-entry}\nLeCun, Y, and L Bottou. 2002. \"Lush Reference Manual.\" code available at\nhttp://lush.sourceforge.net.\n:::" (716:1-719:4, 38195-38327)
 ├─103 paragraph[5] (721:1-727:4, 38329-38691)
 │     ├─0 text "::: {#ref-Leuck-dual-numbers .csl-entry}\nLeuck, Holger, and Hans-Hellmut Nagel. 1999. \"Automatic Differentiation\nFacilitates OF-Integration into Steering-Angle-Based Road Vehicle\nTracking.\" In " (721:1-724:15, 38329-38522)
 │     ├─1 emphasis[1] (724:15-725:63, 38522-38632)
 │     │   └─0 text "1999 Conference on Computer Vision and Pattern\nRecognition (CVPR '99), 23-25 June 1999, Ft. Collins, CO, USA" (724:16-725:62, 38523-38631)
 │     ├─2 text ",\n2360--65. " (725:63-726:11, 38632-38644)
 │     ├─3 link[1] (726:11-726:53, 38644-38686)
 │     │   │ title: null
 │     │   │ url: "https://doi.org/10.1109/CVPR.1999.784659"
 │     │   └─0 text "https://doi.org/10.1109/CVPR.1999.784659" (726:12-726:52, 38645-38685)
 │     └─4 text ".\n:::" (726:53-727:4, 38686-38691)
 ├─104 paragraph[3] (729:1-732:4, 38693-38883)
 │     ├─0 text "::: {#ref-cuda_stream .csl-entry}\nLuitjens, Justin. 2014. \"CUDA Streams.\"\n" (729:1-731:1, 38693-38767)
 │     ├─1 link[1] (731:1-731:112, 38767-38878)
 │     │   │ title: null
 │     │   │ url: "http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf"
 │     │   └─0 text "http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf" (731:2-731:111, 38768-38877)
 │     └─2 text ".\n:::" (731:112-732:4, 38878-38883)
 ├─105 paragraph[1] (734:1-737:4, 38885-39066)
 │     └─0 text "::: {#ref-maclaurin2016phd .csl-entry}\nMaclaurin, Dougal. 2016. \"Modeling, Inference and Optimization with\nComposable Differentiable Procedures.\" PhD thesis, Harvard University.\n:::" (734:1-737:4, 38885-39066)
 ├─106 paragraph[3] (739:1-742:4, 39068-39196)
 │     ├─0 text "::: {#ref-Matlab .csl-entry}\n" (739:1-740:1, 39068-39097)
 │     ├─1 emphasis[1] (740:1-740:32, 39097-39128)
 │     │   └─0 text "MATLAB and Statistics Toolbox" (740:2-740:31, 39098-39127)
 │     └─2 text ". n.d. Natick, Massachusetts, United\nStates: The MathWorks, Inc.\n:::" (740:32-742:4, 39128-39196)
 ├─107 paragraph[3] (744:1-748:4, 39198-39371)
 │     ├─0 text "::: {#ref-Pandas .csl-entry}\nMcKinney, Wes. 2010. \"Data Structures for Statistical Computing in\nPython.\" In " (744:1-746:13, 39198-39306)
 │     ├─1 emphasis[1] (746:13-747:7, 39306-39366)
 │     │   └─0 text "Proceedings of the 9th Python in Science Conference,\n51-56" (746:14-747:6, 39307-39365)
 │     └─2 text ".\n:::" (747:7-748:4, 39366-39371)
 ├─108 paragraph[5] (750:1-755:4, 39373-39617)
 │     ├─0 text "::: {#ref-DyNet .csl-entry}\nNeubig, G., C. Dyer, Y. Goldberg, A. Matthews, W. Ammar, A.\nAnastasopoulos, M. Ballesteros, et al. 2017. \"DyNet: The Dynamic Neural\nNetwork Toolkit.\" " (750:1-753:19, 39373-39551)
 │     ├─1 emphasis[1] (753:19-753:35, 39551-39567)
 │     │   └─0 text "ArXiv e-Prints" (753:20-753:34, 39552-39566)
 │     ├─2 text ", January.\n" (753:35-754:1, 39567-39578)
 │     ├─3 link[1] (754:1-754:35, 39578-39612)
 │     │   │ title: null
 │     │   │ url: "https://arxiv.org/abs/1701.03980"
 │     │   └─0 text "https://arxiv.org/abs/1701.03980" (754:2-754:34, 39579-39611)
 │     └─4 text ".\n:::" (754:35-755:4, 39612-39617)
 ├─109 paragraph[1] (757:1-760:4, 39619-39726)
 │     └─0 text "::: {#ref-Numpy .csl-entry}\nOliphant, Travis. 2006. \"NumPy: A Guide to NumPy.\" USA: Trelgol\nPublishing.\n:::" (757:1-760:4, 39619-39726)
 ├─110 paragraph[3] (762:1-766:4, 39728-39982)
 │     ├─0 text "::: {#ref-pytorch_autodiff .csl-entry}\nPaszke, Adam, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,\nZachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam\nLerer. 2017. \"Automatic Differentiation in PyTorch.\" In " (762:1-765:57, 39728-39962)
 │     ├─1 emphasis[1] (765:57-765:72, 39962-39977)
 │     │   └─0 text "NIPS Workshop" (765:58-765:71, 39963-39976)
 │     └─2 text ".\n:::" (765:72-766:4, 39977-39982)
 ├─111 paragraph[1] (768:1-770:4, 39984-40082)
 │     └─0 text "::: {#ref-nimtorch .csl-entry}\nPetrantoni, Giovanni, and Jörg Wollenschläger. n.d. \"NimTorch.\"\n:::" (768:1-770:4, 39984-40082)
 ├─112 paragraph[5] (772:1-776:4, 40084-40310)
 │     ├─0 text "::: {#ref-Piponi-dual-numbers .csl-entry}\nPiponi, Dan. 2004. \"Automatic Differentiation, C++ Templates, and\nPhotogrammetry.\" " (772:1-774:18, 40084-40209)
 │     ├─1 emphasis[1] (774:18-774:50, 40209-40241)
 │     │   └─0 text "J. Graphics, GPU, & Game Tools" (774:19-774:49, 40210-40240)
 │     ├─2 text " 9 (4): 41--55.\n" (774:50-775:1, 40241-40257)
 │     ├─3 link[1] (775:1-775:49, 40257-40305)
 │     │   │ title: null
 │     │   │ url: "https://doi.org/10.1080/10867651.2004.10504901"
 │     │   └─0 text "https://doi.org/10.1080/10867651.2004.10504901" (775:2-775:48, 40258-40304)
 │     └─4 text ".\n:::" (775:49-776:4, 40305-40310)
 ├─113 paragraph[5] (778:1-782:4, 40312-40502)
 │     ├─0 text "::: {#ref-R .csl-entry}\nR Core Team. n.d. " (778:1-779:19, 40312-40354)
 │     ├─1 emphasis[1] (779:19-780:11, 40354-40411)
 │     │   └─0 text "R: A Language and Environment for Statistical\nComputing" (779:20-780:10, 40355-40410)
 │     ├─2 text ". Vienna, Austria: R Foundation for Statistical Computing.\n" (780:11-781:1, 40411-40470)
 │     ├─3 link[1] (781:1-781:28, 40470-40497)
 │     │   │ title: null
 │     │   │ url: "http://www.R-project.org/"
 │     │   └─0 text "http://www.R-project.org/" (781:2-781:27, 40471-40496)
 │     └─4 text ".\n:::" (781:28-782:4, 40497-40502)
 ├─114 paragraph[5] (784:1-792:4, 40504-41004)
 │     ├─0 text "::: {#ref-Hogwild .csl-entry}\nRecht, Benjamin, Christopher Ré, Stephen J. Wright, and Feng Niu. 2011.\n\"Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient\nDescent.\" In " (784:1-787:14, 40504-40687)
 │     ├─1 emphasis[1] (787:14-789:68, 40687-40879)
 │     │   └─0 text "Advances in Neural Information Processing Systems 24: 25th\nAnnual Conference on Neural Information Processing Systems 2011.\nProceedings of a Meeting Held 12-14 December 2011, Granada, Spain." (787:15-789:67, 40688-40878)
 │     ├─2 text ",\n693--701.\n" (789:68-791:1, 40879-40891)
 │     ├─3 link[1] (791:1-791:109, 40891-40999)
 │     │   │ title: null
 │     │   │ url: "http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent"
 │     │   └─0 text "http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent" (791:2-791:108, 40892-40998)
 │     └─4 text ".\n:::" (791:109-792:4, 40999-41004)
 ├─115 paragraph[5] (794:1-800:4, 41006-41320)
 │     ├─0 text "::: {#ref-CNTK .csl-entry}\nSeide, Frank, and Amit Agarwal. 2016. \"CNTK: Microsoft's Open-Source\nDeep-Learning Toolkit.\" In " (794:1-796:28, 41006-41129)
 │     ├─1 emphasis[1] (796:28-797:65, 41129-41229)
 │     │   └─0 text "Proceedings of the 22Nd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining" (796:29-797:64, 41130-41228)
 │     ├─2 text ",\n2135--35. KDD '16. New York, NY, USA: ACM.\n" (797:65-799:1, 41229-41274)
 │     ├─3 link[1] (799:1-799:42, 41274-41315)
 │     │   │ title: null
 │     │   │ url: "https://doi.org/10.1145/2939672.2945397"
 │     │   └─0 text "https://doi.org/10.1145/2939672.2945397" (799:2-799:41, 41275-41314)
 │     └─4 text ".\n:::" (799:42-800:4, 41315-41320)
 ├─116 paragraph[3] (802:1-807:4, 41322-41566)
 │     ├─0 text "::: {#ref-EBLearn .csl-entry}\nSermanet, Pierre, Koray Kavukcuoglu, and Yann LeCun. 2009. \"Eblearn:\nOpen-Source Energy-Based Learning in c++.\" In " (802:1-804:47, 41322-41467)
 │     ├─1 emphasis[1] (804:47-805:64, 41467-41546)
 │     │   └─0 text "2009 21st IEEE\nInternational Conference on Tools with Artificial Intelligence" (804:48-805:63, 41468-41545)
 │     └─2 text ",\n693--97. IEEE.\n:::" (805:64-807:4, 41546-41566)
 ├─117 paragraph[3] (809:1-814:4, 41568-41859)
 │     ├─0 text "::: {#ref-starcraft_pytorch .csl-entry}\nSynnaeve, G., Z. Lin, J. Gehring, D. Gant, V. Mella, V. Khalidov, N.\nCarion, and N. Usunier. 2018. \"Forward Modeling for Partial Observation\nStrategy Games - a Starcraft Defogger.\" In " (809:1-812:44, 41568-41792)
 │     ├─1 emphasis[1] (812:44-813:32, 41792-41843)
 │     │   └─0 text "Advances in Neural\nInformation Processing Systems" (812:45-813:31, 41793-41842)
 │     └─2 text ", 10761--71.\n:::" (813:32-814:4, 41843-41859)
 ├─118 paragraph[1] (816:1-818:4, 41861-41959)
 │     └─0 text "::: {#ref-python_gil .csl-entry}\nteam, The Python. n.d. \"The CPython Global Interpreter Lock.\"\n:::" (816:1-818:4, 41861-41959)
 ├─119 paragraph[3] (820:1-822:4, 41961-42059)
 │     ├─0 text "::: {#ref-autograd_profiler .csl-entry}\nteam, The PyTorch. n.d.a. " (820:1-821:27, 41961-42027)
 │     ├─1 emphasis[1] (821:27-821:54, 42027-42054)
 │     │   └─0 text "Pytorch Autograd Profiler" (821:28-821:53, 42028-42053)
 │     └─2 text ".\n:::" (821:54-822:4, 42054-42059)
 ├─120 paragraph[3] (824:1-826:4, 42061-42132)
 │     ├─0 text "::: {#ref-torchscript .csl-entry}\n---------. n.d.b. " (824:1-825:19, 42061-42113)
 │     ├─1 emphasis[1] (825:19-825:33, 42113-42127)
 │     │   └─0 text "Torch Script" (825:20-825:32, 42114-42126)
 │     └─2 text ".\n:::" (825:33-826:4, 42127-42132)
 ├─121 paragraph[5] (828:1-832:4, 42134-42361)
 │     ├─0 text "::: {#ref-Theano .csl-entry}\nTheano Development Team. 2016. \"[Theano: A Python framework for fast\ncomputation of mathematical expressions]{.nocase}.\" " (828:1-830:53, 42134-42284)
 │     ├─1 emphasis[1] (830:53-830:69, 42284-42300)
 │     │   └─0 text "arXiv e-Prints" (830:54-830:68, 42285-42299)
 │     ├─2 text "\nabs/1605.02688 (May). " (830:69-831:23, 42300-42323)
 │     ├─3 link[1] (831:23-831:56, 42323-42356)
 │     │   │ title: null
 │     │   │ url: "http://arxiv.org/abs/1605.02688"
 │     │   └─0 text "http://arxiv.org/abs/1605.02688" (831:24-831:55, 42324-42355)
 │     └─4 text ".\n:::" (831:56-832:4, 42356-42361)
 ├─122 paragraph[5] (834:1-841:4, 42363-42752)
 │     ├─0 text "::: {#ref-Chainer .csl-entry}\nTokui, Seiya, Kenta Oono, Shohei Hido, and Justin Clayton. 2015.\n\"Chainer: A Next-Generation Open Source Framework for Deep Learning.\" In\n" (834:1-837:1, 42363-42531)
 │     ├─1 emphasis[1] (837:1-839:16, 42531-42684)
 │     │   └─0 text "Proceedings of Workshop on Machine Learning Systems (LearningSys) in\nthe Twenty-Ninth Annual Conference on Neural Information Processing\nSystems (NIPS)" (837:2-839:15, 42532-42683)
 │     ├─2 text ".\n" (839:16-840:1, 42684-42686)
 │     ├─3 link[1] (840:1-840:62, 42686-42747)
 │     │   │ title: null
 │     │   │ url: "http://learningsys.org/papers/LearningSys_2015_paper_33.pdf"
 │     │   └─0 text "http://learningsys.org/papers/LearningSys_2015_paper_33.pdf" (840:2-840:61, 42687-42746)
 │     └─4 text ".\n:::" (840:62-841:4, 42747-42752)
 └─123 paragraph[5] (843:1-849:4, 42754-43047)
      ├─0 text "::: {#ref-starcraft2 .csl-entry}\nVinyals, Oriol, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander\nSasha Vezhnevets, Michelle Yeo, Alireza Makhzani, et al. 2017.\n\"StarCraft II: A New Challenge for Reinforcement Learning.\" " (843:1-846:61, 42754-42982)
      ├─1 emphasis[1] (846:61-846:67, 42982-42988)
      │   └─0 text "CoRR" (846:62-846:66, 42983-42987)
      ├─2 text "\nabs/1708.04782. " (846:67-847:17, 42988-43005)
      ├─3 link[1] (847:17-847:50, 43005-43038)
      │   │ title: null
      │   │ url: "http://arxiv.org/abs/1708.04782"
      │   └─0 text "http://arxiv.org/abs/1708.04782" (847:18-847:49, 43006-43037)
      └─4 text ".\n:::\n:::" (847:50-849:4, 43038-43047)
Model	Task	Original	FT	TF	SF	MixF
Vicuna-7B	Spider	0.28	0.74	0.76	0.62	0.70
	Gsm8k	0.58	0.74	0.75	0.67	0.73
	Code-search-Python	0.38	0.65	0.65	0.51	0.61
	Alpaca-finance	0.57	0.68	0.67	0.63	0.65
FLAN T5-XL	Spider	0.13	0.33	0.78	0.67	0.70
	Gsm8k	0.29	0.50	0.62	0.51	0.55
	Code-search-Python	0.28	0.44	0.81	0.67	0.78
	Alpaca-finance	0.39	0.56	0.63	0.59	0.60
Dataset	Spider	Gsm8k	Alpaca-Finance	Code-Python
Tokens with the greatest precision increase	AV, SELECT, first, ⟨EOS⟩, template, SUM, G, COUNT, \n, city, WHERE, ’;, (, IST, id	⟨EOS⟩, >>, +, To, <<, this, =, %, know, are, We, calculate, be, The, have	1, Here, (, :, provide, depends, However, goals, amount, 3, there, The, \n, personal, will	”’, (, Here, python, ’, how, doc, snippet, import, based, {, Python, This, :, you
Tokens with the greatest recall increase	SELECT, , FROM, (, IST, ), \n, COUNT, G, first, WHERE, ⟨EOS⟩, IN, ;, MAX, ’;	start, >>, <<, +, find, how, we, =, fore, To, so, \ ⟨EOS⟩, then, let	general, 1, several, This, depends, Here, provide, However, goals, over, (, If, amount, it, can	Here, This, snippet, ”’, ’, how, python, (, takes, Python, you, doc, an, import, def

Framework	Throughput (higher is better)
	AlexNet	VGG-19	ResNet-50	MobileNet	GNMTv2	NCF
Chainer	$778 \pm 15$	N/A	$\textbf{219} \pm 1$	N/A	N/A	N/A
CNTK	$845 \pm{8}$	$84 \pm{3}$	$210 \pm{1}$	N/A	N/A	N/A
MXNet	$\textbf{1554} \pm 22$	$113 \pm 1$	$218 \pm 2$	$444 \pm 2$	N/A	N/A
PaddlePaddle	$933\pm{123}$	$112 \pm{2}$	$192 \pm{4}$	$\textbf{557}\pm{24}$	N/A	N/A
TensorFlow	$1422 \pm 27$	$66 \pm 2$	$200 \pm 1$	$216 \pm 15$	$9631 \pm 1.3%$	$4.8e6 \pm 2.9%$
PyTorch	$1547 \pm 316$	$\textbf{119} \pm 1$	$212 \pm 2$	$463 \pm 17$	$\textbf{15512} \pm 4.8%$	$\textbf{5.4e6} \pm 3.4%$