|
root[124] (1:1-850:1, 0-43048) |
|
├─0 heading[1] (1:1-1:15, 0-14) |
|
│ │ depth: 1 |
|
│ └─0 text "Introduction" (1:3-1:15, 2-14) |
|
├─1 paragraph[1] (3:1-13:13, 16-713) |
|
│ └─0 text "With the increased interest in deep learning in recent years, there has\nbeen an explosion of machine learning tools. Many popular frameworks\nsuch as Caffe (\"Jia et al. \"2014\"), CNTK (Seide and Agarwal 2016),\nTensorFlow (Abadi et al. 2015), and Theano (Theano Development Team\n2016), construct a static dataflow graph that represents the computation\nand which can then be applied repeatedly to batches of data. This\napproach provides visibility into the whole computation ahead of time,\nand can theoretically be leveraged to improve performance and\nscalability. However, it comes at the cost of ease of use, ease of\ndebugging, and flexibility of the types of computation that can be\nrepresented." (3:1-13:13, 16-713) |
|
├─2 paragraph[1] (15:1-20:42, 715-1091) |
|
│ └─0 text "Prior work has recognized the value of dynamic eager execution for deep\nlearning, and some recent frameworks implement this define-by-run\napproach, but do so either at the cost of performance (Chainer (Tokui et\nal. 2015)) or using a less expressive, faster language\n(Torch (Collobert, Bengio, and Mariéthoz 2002), DyNet (Neubig et al.\n2017)), which limits their applicability." (15:1-20:42, 715-1091) |
|
├─3 paragraph[1] (22:1-29:66, 1093-1644) |
|
│ └─0 text "However, with careful implementation and design choices, dynamic eager\nexecution can be achieved largely without sacrificing performance. This\npaper introduces PyTorch, a Python library that performs immediate\nexecution of dynamic tensor computations with automatic differentiation\nand GPU acceleration, and does so while maintaining performance\ncomparable to the fastest current libraries for deep learning. This\ncombination has turned out to be very popular in the research community\nwith, for instance, 296 ICLR 2019 submissions mentioning PyTorch." (22:1-29:66, 1093-1644) |
|
├─4 heading[1] (31:1-31:13, 1646-1658) |
|
│ │ depth: 1 |
|
│ └─0 text "Background" (31:3-31:13, 1648-1658) |
|
├─5 paragraph[1] (33:1-34:29, 1660-1755) |
|
│ └─0 text "Four major trends in scientific computing have become increasingly\nimportant for deep learning." (33:1-34:29, 1660-1755) |
|
├─6 paragraph[5] (36:1-45:35, 1757-2423) |
|
│ ├─0 text "First, starting in the 1960s, the development of domain specific\nlanguages such as APL (Abrams 1970), MATLAB (" (36:1-37:46, 1757-1867) |
|
│ ├─1 emphasis[1] (37:46-38:9, 1867-1898) |
|
│ │ └─0 text "MATLAB and Statistics\nToolbox" (37:47-38:8, 1868-1897) |
|
│ ├─2 text ", n.d.), R (R Core Team, n.d.) and Julia (Bezanson et al. 2017),\nturned multidimensional arrays (often referred to as tensors) into\nfirst-class objects supported by a comprehensive set of mathematical\nprimitives (or operators) to manipulate them. Separately, libraries such\nas NumPy(Oliphant 2006), Torch(Collobert, Bengio, and Mariéthoz 2002),\nEigen(Guennebaud, Jacob, et al. 2010) and Lush(Y. LeCun and Bottou 2002)\nmade " (38:9-44:6, 1898-2321) |
|
│ ├─3 strong[1] (44:6-44:33, 2321-2348) |
|
│ │ └─0 text "array-based programming" (44:8-44:31, 2323-2346) |
|
│ └─4 text " productive in general purpose languages\nsuch as Python, Lisp, C++ and Lua." (44:33-45:35, 2348-2423) |
|
├─7 paragraph[3] (47:1-56:34, 2425-3081) |
|
│ ├─0 text "Second, the development of " (47:1-47:28, 2425-2452) |
|
│ ├─1 strong[1] (47:28-47:57, 2452-2481) |
|
│ │ └─0 text "automatic differentiation" (47:30-47:55, 2454-2479) |
|
│ └─2 text " (Baydin et al.\n2017) made it possible to fully automate the daunting labor of computing\nderivatives. This made it significantly easier to experiment with\ndifferent machine learning approaches while still allowing for efficient\ngradient based optimization. The autograd (Maclaurin 2016) package\npopularized the use of this technique for NumPy arrays, and similar\napproaches are used in frameworks such as Chainer (Tokui et al. 2015),\nDyNet (Neubig et al. 2017), Lush (Y. LeCun and Bottou 2002),\nTorch (Collobert, Bengio, and Mariéthoz 2002), Jax (M. J. et. al. 2018)\nand Flux.jl (M. I. et. al. 2018)." (47:57-56:34, 2481-3081) |
|
├─8 paragraph[5] (58:1-78:9, 3083-4454) |
|
│ ├─0 text "Third, with the advent of the free software movement, the scientific\ncommunity moved away from closed proprietary software such as\nMatlab(" (58:1-60:8, 3083-3221) |
|
│ ├─1 emphasis[1] (60:8-60:39, 3221-3252) |
|
│ │ └─0 text "MATLAB and Statistics Toolbox" (60:9-60:38, 3222-3251) |
|
│ ├─2 text ", n.d.), and towards the\n" (60:39-61:1, 3252-3277) |
|
│ ├─3 strong[1] (61:1-61:33, 3277-3309) |
|
│ │ └─0 text "open-source Python ecosystem" (61:3-61:31, 3279-3307) |
|
│ └─4 text " with packages like NumPy (Oliphant\n2006), SciPy (Jones et al. 2001--), and Pandas (McKinney 2010). This\nfulfilled most of the numerical analysis needs of researchers while\nallowing them to take advantage of a vast repository of libraries to\nhandle dataset preprocessing, statistical analysis, plotting, and more.\nMoreover, the openness, interoperability, and flexibility of free\nsoftware fostered the development of vibrant communities that could\nquickly address new or changing needs by extending the existing\nfunctionality of a library or if needed by developing and releasing\nbrand new ones. While there is a rich offering of open-source software\nfor neural networks in languages other than Python, starting with\nLush (Y. LeCun and Bottou 2002) in Lisp, Torch (Collobert, Bengio, and\nMariéthoz 2002) in C++, Objective-C and Lua, EBLearn (Sermanet,\nKavukcuoglu, and LeCun 2009) in C++, Caffe (\"Jia et al. \"2014\") in\nC++, the network effects of a large ecosystem such as Python made it an\nessential skill to jumpstart one's research. Hence, since 2014, most\ndeep learning frameworks converged on a Python interface as an essential\nfeature." (61:33-78:9, 3309-4454) |
|
├─9 paragraph[3] (80:1-88:36, 4456-5037) |
|
│ ├─0 text "Finally, the availability and commoditization of general-purpose\nmassively parallel hardware such as GPUs provided the computing power\nrequired by deep learning methods. Specialized libraries such as\ncuDNN (Chetlur et al. 2014), along with a body of academic work (such as\n(Lavin 2015) and (Lavin and Gray 2016)), produced a set of\nhigh-performance reusable deep learning kernels that enabled frameworks\nsuch as Caffe (\"Jia et al. \"2014\"), Torch7 (Collobert, Kavukcuoglu,\nand Farabet 2011), or TensorFlow (Abadi et al. 2015) to take advantage\nof these " (80:1-88:10, 4456-5011) |
|
│ ├─1 strong[1] (88:10-88:35, 5011-5036) |
|
│ │ └─0 text "hardware accelerators" (88:12-88:33, 5013-5034) |
|
│ └─2 text "." (88:35-88:36, 5036-5037) |
|
├─10 paragraph[1] (90:1-92:52, 5039-5220) |
|
│ └─0 text "PyTorch builds on these trends by providing an array-based programming\nmodel accelerated by GPUs and differentiable via automatic\ndifferentiation integrated in the Python ecosystem." (90:1-92:52, 5039-5220) |
|
├─11 heading[1] (94:1-94:20, 5222-5241) |
|
│ │ depth: 1 |
|
│ └─0 text "Design principles" (94:3-94:20, 5224-5241) |
|
├─12 paragraph[1] (96:1-98:13, 5243-5396) |
|
│ └─0 text "PyTorch's success stems from weaving previous ideas into a design that\nbalances speed and ease of use. There are four main principles behind\nour choices:" (96:1-98:13, 5243-5396) |
|
├─13 paragraph[2] (100:1-105:57, 5398-5797) |
|
│ ├─0 strong[1] (100:1-100:16, 5398-5413) |
|
│ │ └─0 text "Be Pythonic" (100:3-100:14, 5400-5411) |
|
│ └─1 text " Data scientists are familiar with the Python language,\nits programming model, and its tools. PyTorch should be a first-class\nmember of that ecosystem. It follows the commonly established design\ngoals of keeping interfaces simple and consistent, ideally with one\nidiomatic way of doing things. It also integrates naturally with\nstandard plotting, debugging, and data processing tools." (100:16-105:57, 5413-5797) |
|
├─14 paragraph[2] (107:1-111:48, 5799-6114) |
|
│ ├─0 strong[1] (107:1-107:26, 5799-5824) |
|
│ │ └─0 text "Put researchers first" (107:3-107:24, 5801-5822) |
|
│ └─1 text " PyTorch strives to make writing models, data\nloaders, and optimizers as easy and productive as possible. The\ncomplexity inherent to machine learning should be handled internally by\nthe PyTorch library and hidden behind intuitive APIs free of\nside-effects and unexpected performance cliffs." (107:26-111:48, 5824-6114) |
|
├─15 paragraph[4] (113:1-121:15, 6116-6680) |
|
│ ├─0 strong[1] (113:1-113:34, 6116-6149) |
|
│ │ └─0 text "Provide pragmatic performance" (113:3-113:32, 6118-6147) |
|
│ ├─1 text " To be useful, PyTorch needs to deliver\ncompelling performance, although not at the expense of simplicity and\nease of use. Trading 10% of speed for a significantly simpler to use\nmodel is acceptable; 100% is not. Therefore, its " (113:34-116:50, 6149-6377) |
|
│ ├─2 emphasis[1] (116:50-116:66, 6377-6393) |
|
│ │ └─0 text "implementation" (116:51-116:65, 6378-6392) |
|
│ └─3 text "\naccepts added complexity in order to deliver that performance.\nAdditionally, providing tools that allow researchers to manually control\nthe execution of their code will empower them to find their own\nperformance improvements independent of those that the library provides\nautomatically." (116:66-121:15, 6393-6680) |
|
├─16 paragraph[2] (123:1-129:29, 6682-7131) |
|
│ ├─0 strong[1] (123:1-123:20, 6682-6701) |
|
│ │ └─0 text "Worse is better" (123:3-123:18, 6684-6699) |
|
│ └─1 text " (Gabriel, n.d.) Given a fixed amount of engineering\nresources, and all else being equal, the time saved by keeping the\ninternal implementation of PyTorch simple can be used to implement\nadditional features, adapt to new situations, and keep up with the fast\npace of progress in the field of AI. Therefore it is better to have a\nsimple but slightly incomplete solution than a comprehensive but complex\nand hard to maintain design." (123:20-129:29, 6701-7131) |
|
├─17 heading[1] (131:1-131:27, 7133-7159) |
|
│ │ depth: 1 |
|
│ └─0 text "Usability centric design" (131:3-131:27, 7135-7159) |
|
├─18 heading[1] (133:1-133:49, 7161-7209) |
|
│ │ depth: 2 |
|
│ └─0 text "Deep learning models are just Python programs" (133:4-133:49, 7164-7209) |
|
├─19 paragraph[1] (135:1-148:52, 7211-8140) |
|
│ └─0 text "In a surprisingly short amount of time, machine learning grew from\nrecognizing individual digits (Yann LeCun and Cortes, n.d.) into\nautonomously playing StarCraft (Vinyals et al. 2017). Consequently, the\nneural networks themselves evolved rapidly from simple sequences of feed\nforward layers into incredibly varied numerical programs often composed\nof many loops and recursive functions. To support this growing\ncomplexity, PyTorch foregoes the potential benefits of a\ngraph-metaprogramming based approach to preserve the imperative\nprogramming model of Python. This design was pioneered for model\nauthoring by Chainer(Tokui et al. 2015) and Dynet(Neubig et al. 2017).\nPyTorch extends this to all aspects of deep learning workflows. Defining\nlayers, composing models, loading data, running optimizers, and\nparallelizing the training process are all expressed using the familiar\nconcepts developed for general purpose programming." (135:1-148:52, 7211-8140) |
|
├─20 paragraph[3] (150:1-165:7, 8142-9138) |
|
│ ├─0 text "This solution ensures that any new potential neural network architecture\ncan be easily implemented with PyTorch. For instance, layers (which in\nmodern machine learning should really be understood as stateful\nfunctions with implicit parameters) are typically expressed as Python\nclasses whose constructors create and initialize their parameters, and\nwhose forward methods process an input activation. Similarly, models are\nusually represented as classes that compose individual layers, but let\nus state again that nothing forces the user to structure their code in\nthat way. Listing\n" (150:1-159:1, 8142-8724) |
|
│ ├─1 link[1] (159:1-159:42, 8724-8765) |
|
│ │ │ title: null |
|
│ │ │ url: "#lst:code_example" |
|
│ │ └─0 text "[lst:code_example]" (159:2-159:22, 8725-8745) |
|
│ └─2 text "{reference-type=\"ref\"\nreference=\"lst:code_example\"} demonstrates how an entire model can be\ncreated by composing functionality provided by PyTorch such as 2d\nconvolution, matrix multiplication, dropout, and softmax to classify\ngray-scale images. Note that linear layers are of course part of the\nlibrary, but we show an example implementation to highlight how simple\nit is." (159:42-165:7, 8765-9138) |
|
├─21 paragraph[1] (167:1-169:4, 9140-9163) |
|
│ └─0 text "::: {.parcolumns}\n2\n:::" (167:1-169:4, 9140-9163) |
|
├─22 paragraph[1] (171:1-171:47, 9165-9211) |
|
│ └─0 text "[]{#lst:code_example label=\"lst:code_example\"}" (171:1-171:47, 9165-9211) |
|
├─23 paragraph[3] (173:1-181:65, 9213-9829) |
|
│ ├─0 text "This \"everything is a just a program\" philosophy is not limited to just\nthe models, and applies to optimizers and data loaders as well. This\nfacilitates the experimentation of new training techniques. For example,\nto implement the very popular generative adversarial networks, one needs\nto specify two separate models (the generator and the discriminator),\nand two loss functions that depend on both models at the same time.\nRigid APIs would struggle with this setup, but the simple design\nemployed in PyTorch easily adapts to this setting as shown in\nListing " (173:1-181:9, 9213-9773) |
|
│ ├─1 link[1] (181:9-181:22, 9773-9786) |
|
│ │ │ title: null |
|
│ │ │ url: "#lst:gan" |
|
│ │ └─0 text "1" (181:10-181:11, 9774-9775) |
|
│ └─2 text "{reference-type=\"ref\" reference=\"lst:gan\"}." (181:22-181:65, 9786-9829) |
|
├─24 html "<figure id=\"lst:gan\">\n<div class=\"sourceCode\" id=\"cb1\" data-fontsize=\"\\small\"><pre\nclass=\"sourceCode python\"><code class=\"sourceCode python\"><span id=\"cb1-1\"><a href=\"#cb1-1\" aria-hidden=\"true\" tabindex=\"-1\"></a>discriminator <span class=\"op\">=</span> create_discriminator()</span>\n<span id=\"cb1-2\"><a href=\"#cb1-2\" aria-hidden=\"true\" tabindex=\"-1\"></a>generator <span class=\"op\">=</span> create_generator()</span>\n<span id=\"cb1-3\"><a href=\"#cb1-3\" aria-hidden=\"true\" tabindex=\"-1\"></a>optimD <span class=\"op\">=</span> optim.Adam(discriminator.parameters())</span>\n<span id=\"cb1-4\"><a href=\"#cb1-4\" aria-hidden=\"true\" tabindex=\"-1\"></a>optimG <span class=\"op\">=</span> optim.Adam(generator.parameters())</span>\n<span id=\"cb1-5\"><a href=\"#cb1-5\" aria-hidden=\"true\" tabindex=\"-1\"></a></span>\n<span id=\"cb1-6\"><a href=\"#cb1-6\" aria-hidden=\"true\" tabindex=\"-1\"></a><span class=\"kw\">def</span> step(real_sample):</span>\n<span id=\"cb1-7\"><a href=\"#cb1-7\" aria-hidden=\"true\" tabindex=\"-1\"></a> <span class=\"co\"># (1) Update Discriminator</span></span>\n<span id=\"cb1-8\"><a href=\"#cb1-8\" aria-hidden=\"true\" tabindex=\"-1\"></a> errD_real <span class=\"op\">=</span> loss(discriminator(real_sample), real_label)</span>\n<span id=\"cb1-9\"><a href=\"#cb1-9\" aria-hidden=\"true\" tabindex=\"-1\"></a> errD_real.backward()</span>\n<span id=\"cb1-10\"><a href=\"#cb1-10\" aria-hidden=\"true\" tabindex=\"-1\"></a> fake <span class=\"op\">=</span> generator(get_noise())</span>\n<span id=\"cb1-11\"><a href=\"#cb1-11\" aria-hidden=\"true\" tabindex=\"-1\"></a> errD_fake <span class=\"op\">=</span> loss(discriminator(fake.detach(), fake_label)</span>\n<span id=\"cb1-12\"><a href=\"#cb1-12\" aria-hidden=\"true\" tabindex=\"-1\"></a> errD_fake.backward()</span>\n<span id=\"cb1-13\"><a href=\"#cb1-13\" aria-hidden=\"true\" tabindex=\"-1\"></a> optimD.step()</span>\n<span id=\"cb1-14\"><a href=\"#cb1-14\" aria-hidden=\"true\" tabindex=\"-1\"></a> <span class=\"co\"># (2) Update Generator</span></span>\n<span id=\"cb1-15\"><a href=\"#cb1-15\" aria-hidden=\"true\" tabindex=\"-1\"></a> errG <span class=\"op\">=</span> loss(discriminator(fake), real_label)</span>\n<span id=\"cb1-16\"><a href=\"#cb1-16\" aria-hidden=\"true\" tabindex=\"-1\"></a> errG.backward()</span>\n<span id=\"cb1-17\"><a href=\"#cb1-17\" aria-hidden=\"true\" tabindex=\"-1\"></a> optimG.step()</span></code></pre></div>\n<p><span id=\"lst:gan\" label=\"lst:gan\"></span></p>\n</figure>" (183:1-203:10, 9831-12190) |
|
├─25 paragraph[1] (205:1-211:43, 12192-12644) |
|
│ └─0 text "Since PyTorch programs execute eagerly, all the features of Python are\navailable throughout the whole design process. Print statements,\nstandard debuggers, and common visualization tools like matplotlib all\nwork as expected. Users do not have to wait for lengthy compilation\nbefore they can start running their programs, and more importantly\nintermediate computations can be observed to understand how a model\nworks and whether its results are correct." (205:1-211:43, 12192-12644) |
|
├─26 heading[1] (213:1-213:38, 12646-12683) |
|
│ │ depth: 2 |
|
│ └─0 text "Interoperability and extensibility" (213:4-213:38, 12649-12683) |
|
├─27 paragraph[5] (215:1-226:43, 12685-13508) |
|
│ ├─0 text "Easy and efficient interoperability is one of the top priorities for\nPyTorch because it opens the possibility to leverage the rich ecosystem\nof Python libraries as part of user programs. Hence, PyTorch allows for\nbidirectional exchange of data with external libraries. For example, it\nprovides a mechanism to convert between NumPy arrays and PyTorch tensors\nusing the " (215:1-220:11, 12685-13053) |
|
│ ├─1 inlineCode "torch.from_numpy()" (220:11-220:31, 13053-13073) |
|
│ ├─2 text " function and " (220:31-220:45, 13073-13087) |
|
│ ├─3 inlineCode ".numpy()" (220:45-220:55, 13087-13097) |
|
│ └─4 text " tensor method.\nSimilar functionality is also available to exchange data stored using\nthe DLPack (DMLC, n.d.) format. Note that this exchange happens in both\ncases without any data copying -- objects on both sides only describe\nhow to interpret a memory region which is shared among them. Hence,\nthose operations are actually extremely cheap, and take constant time no\nmatter how large the converted arrays are." (220:55-226:43, 13097-13508) |
|
├─28 paragraph[15] (228:1-242:30, 13510-14504) |
|
│ ├─0 text "Moreover, many of the critical systems are designed specifically to be\nextensible. For instance, the automatic differentiation system allows\nusers to add support for custom differentiable functions. To do that\nusers can define a new subclass of " (228:1-231:36, 13510-13755) |
|
│ ├─1 inlineCode "torch.autograd.Function" (231:36-231:61, 13755-13780) |
|
│ ├─2 text " that\nimplements " (231:61-232:12, 13780-13797) |
|
│ ├─3 inlineCode "forward()" (232:12-232:23, 13797-13808) |
|
│ ├─4 text " and " (232:23-232:28, 13808-13813) |
|
│ ├─5 inlineCode "backward()" (232:28-232:40, 13813-13825) |
|
│ ├─6 text " methods, which specify the\nfunction and its derivative (or more formally the vector-Jacobian\nproduct). Similarly new datasets can be added by subclassing\n" (232:40-235:1, 13825-13980) |
|
│ ├─7 inlineCode "torch.utils.data.Dataset" (235:1-235:27, 13980-14006) |
|
│ ├─8 text " and implementing two methods: " (235:27-235:58, 14006-14037) |
|
│ ├─9 inlineCode "__getitem__" (235:58-235:71, 14037-14050) |
|
│ ├─10 text "\n(the indexing operator) and " (235:71-236:29, 14050-14079) |
|
│ ├─11 inlineCode "__len__" (236:29-236:38, 14079-14088) |
|
│ ├─12 text " (the length operator), making\ndatasets behave like (possibly lazy) lists. How these work is completely\nup to the implementer, and many users leverage other Python packages for\ndata loading. The " (236:38-239:19, 14088-14283) |
|
│ ├─13 inlineCode "DataLoader" (239:19-239:31, 14283-14295) |
|
│ └─14 text " class consumes objects conforming to this\ninterface and provides an iterator over the data which takes care of\nshuffling, batching, parallelization, and management of pinned CUDA\nmemory to improve throughput." (239:31-242:30, 14295-14504) |
|
├─29 paragraph[1] (244:1-247:64, 14506-14773) |
|
│ └─0 text "Most importantly, users are free to replace any component of PyTorch\nthat does not meet the needs or performance requirements of their\nproject. They are all designed to be completely interchangeable, and\nPyTorch takes great care not to impose any particular solution." (244:1-247:64, 14506-14773) |
|
├─30 heading[1] (249:1-249:29, 14775-14803) |
|
│ │ depth: 2 |
|
│ └─0 text "Automatic differentiation" (249:4-249:29, 14778-14803) |
|
├─31 paragraph[1] (251:1-265:62, 14805-15837) |
|
│ └─0 text "Since gradient based optimization is vital to deep learning, PyTorch\nmust be able to automatically compute gradients of models specified by\nour users, and those can be arbitrary Python programs. However, Python\nis a dynamic programming language that allows changing most behaviors at\nruntime, making ahead of time source-to-source differentiation\ncumbersome. Instead, PyTorch uses the operator overloading approach,\nwhich builds up a representation of the computed function every time it\nis executed. In its current implementation (Paszke et al. 2017), PyTorch\nperforms reverse-mode automatic differentiation, which computes the\ngradient of a scalar output with respect to a multivariate input.\nDifferentiating functions with more outputs than inputs is more\nefficiently executed using forward-mode automatic differentiation, but\nthis use case is less common for machine learning applications. PyTorch\ncan be easily extended to perform forward-mode differentiation using\narray-level dual numbers (Piponi 2004; Leuck and Nagel 1999)." (251:1-265:62, 14805-15837) |
|
├─32 paragraph[1] (267:1-279:62, 15839-16750) |
|
│ └─0 text "Another interesting and uncommon feature of our system is that it can\ndifferentiate through code employing mutation on tensors, which is one\nof the basic building blocks of imperative programs. To ensure safety,\nwe have implemented a versioning system for tensors, which lets us track\ntheir modifications and ensure that we always use the data we expect.\nOne interesting tradeoff is that while we could utilize techniques like\ncopy-on-write to support arbitrary programs, we chose to not go down\nthis path, as performance-wise it is usually beneficial for the users to\nrewrite their code to ensure that no copies have to be performed. Hence,\nwhile most mutations are benign and can be handled automatically, the\nreally complicated cases result in a user error, which lets them know\nthat they likely want to restructure the program. This allows us to\navoid introducing subtle and hard-to-find performance cliffs." (267:1-279:62, 15839-16750) |
|
├─33 heading[1] (281:1-281:37, 16752-16788) |
|
│ │ depth: 1 |
|
│ └─0 text "Performance focused implementation" (281:3-281:37, 16754-16788) |
|
├─34 paragraph[1] (283:1-289:22, 16790-17227) |
|
│ └─0 text "Running deep learning algorithms efficiently from a Python interpreter\nis notoriously challenging: for instance, the global interpreter\nlock (The Python team, n.d.) effectively ensures that only one of any\nnumber of concurrent threads is running at any given time. Deep learning\nframeworks based on the construction of a static data-flow graph\nsidestep this problem by deferring the evaluation of the computation to\na custom interpreter." (283:1-289:22, 16790-17227) |
|
├─35 paragraph[1] (291:1-293:52, 17229-17419) |
|
│ └─0 text "PyTorch solved the problem differently, by carefully optimizing every\naspect of its execution while simultaneously empowering its users to\neasily leverage additional optimization strategies." (291:1-293:52, 17229-17419) |
|
├─36 heading[1] (295:1-295:25, 17421-17445) |
|
│ │ depth: 2 |
|
│ └─0 text "An efficient C++ core" (295:4-295:25, 17424-17445) |
|
├─37 paragraph[3] (297:1-310:18, 17447-18374) |
|
│ ├─0 text "Despite being closely integrated in the Python ecosystem, most of\nPyTorch is written in C++ to achieve high performance. This core\n" (297:1-299:1, 17447-17578) |
|
│ ├─1 inlineCode "libtorch" (299:1-299:11, 17578-17588) |
|
│ └─2 text " library implements the tensor data structure, the GPU and CPU\noperators, and basic parallel primitives. It also provides the automatic\ndifferentiation system, including the gradient formulas for most\nbuilt-in functions. This ensures that the computation of the derivatives\nof functions composed of core PyTorch operators is executed entirely in\na multithreaded evaluator which does not require holding the Python\nglobal interpreter lock (The Python team, n.d.). Python bindings are\ngenerated using YAML meta-data files. An interesting side-effect of this\napproach is that it allowed our community to quickly create bindings to\nmultiple other languages resulting in projects like NimTorch (Petrantoni\nand Wollenschläger, n.d.), hasktorch (Huang, Hashimoto, and Stites,\nn.d.) and others." (299:11-310:18, 17588-18374) |
|
├─38 paragraph[1] (312:1-317:46, 18376-18756) |
|
│ └─0 text "This design also allowed us to create first-class C++ bindings and\nmodeling libraries that can be used in places where Python is\ninconvenient, such as the game engine for Starcraft (Synnaeve et al.\n2018) or on mobile platforms. It is even possible to take the Python\ncode describing a PyTorch model and run it without Python using the\nTorchScript engine (The PyTorch team, n.d.b)." (312:1-317:46, 18376-18756) |
|
├─39 heading[1] (319:1-319:34, 18758-18791) |
|
│ │ depth: 2 |
|
│ └─0 text "Separate control and data flow" (319:4-319:34, 18761-18791) |
|
├─40 paragraph[1] (321:1-326:29, 18793-19170) |
|
│ └─0 text "PyTorch maintains a strict separation between its control (i.e. program\nbranches, loops) and data flow (i.e. tensors and the operations\nperformed on them). The resolution of the control flow is handled by\nPython and optimized C++ code executed on the host CPU, and result in a\nlinear sequence of operator invocations on the device. Operators can be\nrun either on CPU or on GPU." (321:1-326:29, 18793-19170) |
|
├─41 paragraph[1] (328:1-337:24, 19172-19827) |
|
│ └─0 text "PyTorch is designed to execute operators asynchronously on GPU by\nleveraging the CUDA stream mechanism (Luitjens 2014) to queue CUDA\nkernel invocations to the GPUs hardware FIFO. This allows the system to\noverlap the execution of Python code on CPU with tensor operators on\nGPU. Because the tensor operations usually take a significant amount of\ntime, this lets us saturate the GPU and reach peak performance even in\nan interpreted language with fairly high overhead like Python. Note that\nthis mechanism is nearly invisible to the user. Unless they implement\ntheir own multi-stream primitives all of the CPU-GPU synchronization is\nhandled by the library." (328:1-337:24, 19172-19827) |
|
├─42 paragraph[1] (339:1-342:25, 19829-20054) |
|
│ └─0 text "PyTorch could leverage a similar mechanism to also execute operators\nasynchronously on the CPU. However the costs of cross-thread\ncommunication and synchronization would negate the performance benefit\nof such an optimization." (339:1-342:25, 19829-20054) |
|
├─43 heading[1] (344:1-344:35, 20056-20090) |
|
│ │ depth: 2 |
|
│ └─0 text "Custom caching tensor allocator" (344:4-344:35, 20059-20090) |
|
├─44 paragraph[3] (346:1-357:50, 20092-20903) |
|
│ ├─0 text "Almost every operator must dynamically allocate an output tensor to hold\nthe result of its execution. It is therefore critical to optimize the\nspeed of the dynamic memory allocators. PyTorch can rely on optimized\nlibraries (Berger et al. 2000; Evans May 2006; Ghemawat and Menage,\nn.d.) to handle this task on CPU. However, on GPU the " (346:1-350:55, 20092-20427) |
|
│ ├─1 inlineCode "cudaFree" (350:55-350:65, 20427-20437) |
|
│ └─2 text " routine\nmay block its caller until all previously queued work on all GPUs\ncompletes. To avoid this bottleneck, PyTorch implements a custom\nallocator which incrementally builds up a cache of CUDA memory and\nreassigns it to later allocations without further use of CUDA APIs. The\nincremental allocation is also crucial for better interoperability,\nbecause taking up all GPU memory ahead of time would prevent the user\nfrom utilizing other GPU-enabled Python packages." (350:65-357:50, 20437-20903) |
|
├─45 paragraph[1] (359:1-363:14, 20905-21204) |
|
│ └─0 text "To further improve its effectiveness, this allocator was tuned for the\nspecific memory usage patterns of deep learning. For example, it rounds\nup allocations to multiples of 512 bytes to avoid fragmentation issues.\nMoreover, it maintains a distinct pool of memory for every CUDA stream\n(work queue)." (359:1-363:14, 20905-21204) |
|
├─46 paragraph[3] (365:1-373:60, 21206-21825) |
|
│ ├─0 text "The one-pool-per-stream design assumption simplifies the implementation\nand improves the performance of the allocator: because the CPU runs\nahead of the GPU, memory is freed on the CPU " (365:1-367:46, 21206-21391) |
|
│ ├─1 emphasis[1] (367:46-367:54, 21391-21399) |
|
│ │ └─0 text "before" (367:47-367:53, 21392-21398) |
|
│ └─2 text " its last use on\nthe GPU finishes. Since streams serialize execution, if the free\nprecedes the reallocation on the CPU, the same order will occur on the\nGPU. So the allocator can reallocate memory freed on the CPU immediately\nas long as the new allocation is used on the same stream as the freed\nregion. However, if an allocation was last used on one stream and then\nallocated on another, additional synchronization is needed." (367:54-373:60, 21399-21825) |
|
├─47 paragraph[1] (375:1-383:33, 21827-22418) |
|
│ └─0 text "The one-pool-per-stream design seems limiting since the allocations end\nup fragmented per stream, but in practice PyTorch almost never uses\nmultiple streams. It is notoriously hard to write CUDA kernels in a way\nthat would let them cooperatively share the GPU because exact scheduling\nis hardware controlled. In practice, kernel writers usually resort to\nmonolithic kernels that combine multiple tasks. Data loading and\ndistributed computing utilities are exceptions to the one stream design,\nand they carefully insert additional synchronization to avoid bad\ninteractions with the allocator." (375:1-383:33, 21827-22418) |
|
├─48 paragraph[1] (385:1-387:32, 22420-22590) |
|
│ └─0 text "While this design is susceptible to certain corner cases, it almost\nnever exhibits unwanted behaviors in practical code. Most of our users\nare not aware of its existence." (385:1-387:32, 22420-22590) |
|
├─49 heading[1] (389:1-389:19, 22592-22610) |
|
│ │ depth: 2 |
|
│ └─0 text "Multiprocessing" (389:4-389:19, 22595-22610) |
|
├─50 paragraph[3] (391:1-396:26, 22612-22985) |
|
│ ├─0 text "Due to the global interpreter lock (GIL) Python's default implementation\ndoes not allow concurrent threads to execute in parallel. To alleviate\nthis problem, the Python community has established a standard\n" (391:1-394:1, 22612-22818) |
|
│ ├─1 inlineCode "multiprocessing" (394:1-394:18, 22818-22835) |
|
│ └─2 text " module, containing a number of utilities that allow\nusers to easily spawn child processes and implement basic inter-process\ncommunication primitives." (394:18-396:26, 22835-22985) |
|
├─51 paragraph[5] (398:1-404:43, 22987-23435) |
|
│ ├─0 text "However, the implementation of the primitives uses the same form of\nserialization used for on-disk persistence, which is inefficient when\ndealing with large arrays. Hence, PyTorch extends the Python\n" (398:1-401:1, 22987-23186) |
|
│ ├─1 inlineCode "multiprocessing" (401:1-401:18, 23186-23203) |
|
│ ├─2 text " module into " (401:18-401:31, 23203-23216) |
|
│ ├─3 inlineCode "torch.multiprocessing" (401:31-401:54, 23216-23239) |
|
│ └─4 text ", which is a\ndrop-in replacement for the built in package and automatically moves the\ndata of tensors sent to other processes to shared memory instead of\nsending it over the communication channel." (401:54-404:43, 23239-23435) |
|
├─52 paragraph[1] (406:1-410:45, 23437-23759) |
|
│ └─0 text "This design greatly improves performance and makes the process isolation\nweaker, resulting in a programming model which more closely resembles\nregular threaded programs. Users can easily implement heavily parallel\nprograms that operate on independent GPUs but later synchronize\ngradients using all-reduce style primitives." (406:1-410:45, 23437-23759) |
|
├─53 paragraph[1] (412:1-414:29, 23761-23929) |
|
│ └─0 text "Another unique feature of this system is that it transparently handles\nsharing of CUDA tensors, making it easy to implement techniques like\nHogwild (Recht et al. 2011)." (412:1-414:29, 23761-23929) |
|
├─54 heading[1] (416:1-416:22, 23931-23952) |
|
│ │ depth: 2 |
|
│ └─0 text "Reference counting" (416:4-416:22, 23934-23952) |
|
├─55 paragraph[1] (418:1-421:69, 23954-24236) |
|
│ └─0 text "Users often design their models to utilize all memory available during\ntraining, and increasing batch sizes is a common technique of speeding\nup the process. Therefore, to deliver great performance, PyTorch has to\ntreat memory as a scarce resource that it needs to manage carefully." (418:1-421:69, 23954-24236) |
|
├─56 paragraph[1] (423:1-433:50, 24238-24993) |
|
│ └─0 text "Libraries with eager semantics have to manage tensor memory without\nknowing how it will be used in the future. Garbage collection is the\ntypical way to handle this automatically because it has good amortized\nperformance. In this approach, the runtime periodically investigates the\nstate of the system, enumerates used objects and frees everything else.\nHowever, by deferring the deallocation, it causes the program to use\nmore memory overall (Hertz and Berger 2005). Given the scarcity of GPU\nmemory, these overheads are unacceptable. In fact, Torch7 utilized the\ngarbage collector built into Lua, and a common anti-pattern among the\nusers was to sprinkle the program with explicit triggers to the garbage\ncollector, hoping that the memory errors go away." (423:1-433:50, 24238-24993) |
|
├─57 paragraph[5] (435:1-441:50, 24995-25464) |
|
│ ├─0 text "PyTorch takes a different approach: it relies on a reference counting\nscheme to track the number of uses of each tensor, and frees the\nunderlying memory " (435:1-437:19, 24995-25148) |
|
│ ├─1 emphasis[1] (437:19-437:32, 25148-25161) |
|
│ │ └─0 text "immediately" (437:20-437:31, 25149-25160) |
|
│ ├─2 text " once this count reaches zero. Note that\nPyTorch tracks both references internal to the " (437:32-438:48, 25161-25249) |
|
│ ├─3 inlineCode "libtorch" (438:48-438:58, 25249-25259) |
|
│ └─4 text " library and\nexternal references made by users in their Python code by integrating\nwith Python's own reference counting mechanism. This ensures that memory\nis released exactly when tensors become unneeded." (438:58-441:50, 25259-25464) |
|
├─58 paragraph[1] (443:1-449:69, 25466-25949) |
|
│ └─0 text "One notable caveat is that we can only guarantee the desired performance\ncharacteristics in implementations of languages that either already\nutilize reference counting (CPython, Swift, but not PyPy or many\nscripting languages such as Lua), and those that allow for user-defined\nbehavior for assignment, copies, and moves (e.g. C++, Rust). Bindings to\nimplementations that do not satisfy those criteria will have to\nimplement their own specialized memory management on top of PyTorch." (443:1-449:69, 25466-25949) |
|
├─59 heading[1] (451:1-451:13, 25951-25963) |
|
│ │ depth: 1 |
|
│ └─0 text "Evaluation" (451:3-451:13, 25953-25963) |
|
├─60 paragraph[1] (453:1-457:25, 25965-26268) |
|
│ └─0 text "In this section we compare the performance of PyTorch with several other\ncommonly-used deep learning libraries, and find that it achieves\ncompetitive performance across a range of tasks. All experiments were\nperformed on a workstation with two Intel Xeon E5-2698 v4 CPUs and one\nNVIDIA Quadro GP100 GPU." (453:1-457:25, 25965-26268) |
|
├─61 heading[1] (459:1-459:25, 26270-26294) |
|
│ │ depth: 2 |
|
│ └─0 text "Asynchronous dataflow" (459:4-459:25, 26273-26294) |
|
├─62 paragraph[1] (461:1-464:27, 26296-26539) |
|
│ └─0 text "We start by quantifying the ability of PyTorch to asynchronously execute\ndataflow on GPU. We use the built-in profiler (The PyTorch team, n.d.a)\nto instrument various benchmarks and record a timeline of the execution\nof a single training step." (461:1-464:27, 26296-26539) |
|
├─63 paragraph[3] (466:1-476:56, 26541-27221) |
|
│ ├─0 text "Figure\n" (466:1-467:1, 26541-26548) |
|
│ ├─1 link[1] (467:1-467:48, 26548-26595) |
|
│ │ │ title: null |
|
│ │ │ url: "#fig:async_execution" |
|
│ │ └─0 text "[fig:async_execution]" (467:2-467:25, 26549-26572) |
|
│ └─2 text "{reference-type=\"ref\"\nreference=\"fig:async_execution\"} shows a representative timeline of\nexecution for the first few operations of a ResNet-50 model. The host\nCPU which queues the work quickly outpaces the execution of the\noperators on the GPU. This allows PyTorch to achieve almost perfect\ndevice utilization. In this example, GPU execution takes around three\ntimes longer than CPU scheduling. The exact ratio depends on the\nrelative performance of the host CPU and the GPU, as well as the number\nof elements in each tensor and the average arithmetic complexity of the\nfloating point computations to be performed on the GPU." (467:48-476:56, 26595-27221) |
|
├─64 paragraph[3] (478:1-480:4, 27223-27297) |
|
│ ├─0 text "::: {.center}\n" (478:1-479:1, 27223-27237) |
|
│ ├─1 image (479:1-479:36, 27237-27272) |
|
│ │ title: null |
|
│ │ url: "async_kernel_launches.pdf" |
|
│ │ alt: "image" |
|
│ └─2 text "{width=\"\\textwidth\"}\n:::" (479:36-480:4, 27272-27297) |
|
├─65 heading[1] (482:1-482:21, 27299-27319) |
|
│ │ depth: 2 |
|
│ └─0 text "Memory management" (482:4-482:21, 27302-27319) |
|
├─66 paragraph[7] (484:1-494:70, 27321-28091) |
|
│ ├─0 text "We used the NVIDIA profiler to trace the execution of the CUDA runtime\nas well as the execution of the CUDA kernels launched during one\ntraining iteration of the ResNet-50 model. As shown in\nFigure " (484:1-487:8, 27321-27519) |
|
│ ├─1 link[1] (487:8-487:71, 27519-27582) |
|
│ │ │ title: null |
|
│ │ │ url: "#fig:resnet_annotated_traces" |
|
│ │ └─0 text "[fig:resnet_annotated_traces]" (487:9-487:40, 27520-27551) |
|
│ ├─2 text "{reference-type=\"ref\"\nreference=\"fig:resnet_annotated_traces\"}, the behavior of the first\niteration differs significantly from that of subsequent ones. At first,\ncalls to the CUDA memory management functions (" (487:71-490:48, 27582-27791) |
|
│ ├─3 inlineCode "cudaMalloc" (490:48-490:60, 27791-27803) |
|
│ ├─4 text " and\n" (490:60-491:1, 27803-27808) |
|
│ ├─5 inlineCode "cudaFree" (491:1-491:11, 27808-27818) |
|
│ └─6 text ") slow down the execution quite dramatically by blocking the\nCPU thread for long periods of time, hence lowering the utilization of\nthe GPU. This effect disappears in subsequent iterations as the PyTorch\ncaching memory allocator starts reusing previously allocated regions." (491:11-494:70, 27818-28091) |
|
├─67 paragraph[3] (496:1-498:4, 28093-28171) |
|
│ ├─0 text "::: {.center}\n" (496:1-497:1, 28093-28107) |
|
│ ├─1 image (497:1-497:40, 28107-28146) |
|
│ │ title: null |
|
│ │ url: "resnet50_annotated_traces.pdf" |
|
│ │ alt: "image" |
|
│ └─2 text "{width=\"\\textwidth\"}\n:::" (497:40-498:4, 28146-28171) |
|
├─68 heading[1] (500:1-500:14, 28173-28186) |
|
│ │ depth: 2 |
|
│ └─0 text "Benchmarks" (500:4-500:14, 28176-28186) |
|
├─69 paragraph[1] (502:1-506:66, 28188-28528) |
|
│ └─0 text "Finally, we can get an overall sense of single-machine eager mode\nperformance of PyTorch by comparing it to three popular graph-based deep\nlearning frameworks (CNTK, MXNet and TensorFlow), a define-by-run\nframework (Chainer), and production oriented platform (PaddlePaddle).\nThe Appendix details all the steps needed to reproduce our setup." (502:1-506:66, 28188-28528) |
|
├─70 paragraph[3] (508:1-514:11, 28530-28892) |
|
│ ├─0 text "Our results are summarized in\nTable " (508:1-509:7, 28530-28566) |
|
│ ├─1 link[1] (509:7-509:34, 28566-28593) |
|
│ │ │ title: null |
|
│ │ │ url: "#detailed_perf_results" |
|
│ │ └─0 text "1" (509:8-509:9, 28567-28568) |
|
│ └─2 text "{reference-type=\"ref\"\nreference=\"detailed_perf_results\"}. On all the benchmarks, the\nperformance of PyTorch is within 17% of that of of the fastest\nframework. We attribute this result to the fact that these tools offload\nmost of the computation to the same version of the cuDNN and cuBLAS\nlibraries." (509:34-514:11, 28593-28892) |
|
├─71 paragraph[3] (516:1-526:179, 28894-30712) |
|
│ ├─0 text "::: {#detailed_perf_results}\n| | | | | | | |\n|:-------------|:-------------------------------:|:--------------------:|:--------------------:|:---------------------:|:--------------------------:|:--------------------------:|\n| Framework | " (516:1-519:18, 28894-29298) |
|
│ ├─1 emphasis[1] (519:18-519:49, 29298-29329) |
|
│ │ └─0 text "Throughput (higher is better)" (519:19-519:48, 29299-29328) |
|
│ └─2 text " | | | | | |\n| | AlexNet | VGG-19 | ResNet-50 | MobileNet | GNMTv2 | NCF |\n| Chainer | $778 \\pm 15$ | N/A | $\\textbf{219} \\pm 1$ | N/A | N/A | N/A |\n| CNTK | $845 \\pm{8}$ | $84 \\pm{3}$ | $210 \\pm{1}$ | N/A | N/A | N/A |\n| MXNet | $\\textbf{1554} \\pm 22$ | $113 \\pm 1$ | $218 \\pm 2$ | $444 \\pm 2$ | N/A | N/A |\n| PaddlePaddle | $933\\pm{123}$ | $112 \\pm{2}$ | $192 \\pm{4}$ | $\\textbf{557}\\pm{24}$ | N/A | N/A |\n| TensorFlow | $1422 \\pm 27$ | $66 \\pm 2$ | $200 \\pm 1$ | $216 \\pm 15$ | $9631 \\pm 1.3%$ | $4.8e6 \\pm 2.9%$ |\n| PyTorch | $1547 \\pm 316$ | $\\textbf{119} \\pm 1$ | $212 \\pm 2$ | $463 \\pm 17$ | $\\textbf{15512} \\pm 4.8%$ | $\\textbf{5.4e6} \\pm 3.4%$ |" (519:49-526:179, 29329-30712) |
|
├─72 paragraph[1] (528:1-533:4, 30714-31006) |
|
│ └─0 text "Training speed for 6 models using 32bit floats. Throughput is measured\nin images per second for the AlexNet, VGG-19, ResNet-50, and MobileNet\nmodels, in tokens per second for the GNMTv2 model, and in samples per\nsecond for the NCF model. The fastest speed for each model is shown in\nbold.\n:::" (528:1-533:4, 30714-31006) |
|
├─73 heading[1] (535:1-535:12, 31008-31019) |
|
│ │ depth: 2 |
|
│ └─0 text "Adoption" (535:4-535:12, 31011-31019) |
|
├─74 paragraph[3] (537:1-548:34, 31021-31811) |
|
│ ├─0 text "The validity of design decisions and their impact on ease-of-use is hard\nto measure. As a proxy, we tried to quantify how well the machine\nlearning community received PyTorch by counting how often various\nmachine learning tools (including Caffe, Chainer, CNTK, Keras, MXNet,\nPyTorch, TensorFlow, and Theano) are mentioned on arXiv e-Prints since\nthe initial release of PyTorch in January 2017. In Figure\n" (537:1-543:1, 31021-31425) |
|
│ ├─1 link[1] (543:1-543:54, 31425-31478) |
|
│ │ │ title: null |
|
│ │ │ url: "#fig:pytorch_references" |
|
│ │ └─0 text "[fig:pytorch_references]" (543:2-543:28, 31426-31452) |
|
│ └─2 text "{reference-type=\"ref\"\nreference=\"fig:pytorch_references\"} we report the monthly number of\nmentions of the word \"PyTorch\" as a percentage of all mentions among\nthese deep learning frameworks. We counted tools mentioned multiple\ntimes in a given paper only once, and made the search case insensitive\nto account for various spellings." (543:54-548:34, 31478-31811) |
|
├─75 paragraph[3] (550:1-552:4, 31813-31880) |
|
│ ├─0 text "::: {.center}\n" (550:1-551:1, 31813-31827) |
|
│ ├─1 image (551:1-551:29, 31827-31855) |
|
│ │ title: null |
|
│ │ url: "arxiv_mentions.pdf" |
|
│ │ alt: "image" |
|
│ └─2 text "{width=\"\\linewidth\"}\n:::" (551:29-552:4, 31855-31880) |
|
├─76 heading[1] (554:1-554:29, 31882-31910) |
|
│ │ depth: 1 |
|
│ └─0 text "Conclusion and future work" (554:3-554:29, 31884-31910) |
|
├─77 paragraph[1] (556:1-566:17, 31912-32615) |
|
│ └─0 text "PyTorch has become a popular tool in the deep learning research\ncommunity by combining a focus on usability with careful performance\nconsiderations. In addition to continuing to support the latest trends\nand advances in deep learning, in the future we plan to continue to\nimprove the speed and scalability of PyTorch. Most notably, we are\nworking on the PyTorch JIT: a suite of tools that allow PyTorch programs\nto be executed outside of the Python interpreter where they can be\nfurther optimized. We also intend to improve support for distributed\ncomputation by providing efficient primitives for data parallelism as\nwell as a Pythonic library for model parallelism based around remote\nprocedure calls." (556:1-566:17, 31912-32615) |
|
├─78 heading[1] (568:1-568:19, 32617-32635) |
|
│ │ depth: 1 |
|
│ └─0 text "Acknowledgements" (568:3-568:19, 32619-32635) |
|
├─79 paragraph[1] (570:1-587:57, 32637-33864) |
|
│ └─0 text "We are grateful to the PyTorch community for their feedback and\ncontributions that greatly influenced the design and implementation of\nPyTorch. We thank all the PyTorch core team members, contributors and\npackage maintainers including Ailing Zhang, Alex Suhan, Alfredo Mendoza,\nAlican Bozkurt, Andrew Tulloch, Ansha Yu, Anthony Shoumikhin, Bram\nWasti, Brian Vaughan, Christian Puhrsch, David Reiss, David Riazati,\nDavide Libenzi, Dmytro Dzhulgakov, Dwaraj Rajagopal, Edward Yang, Elias\nEllison, Fritz Obermeyer, George Zhang, Hao Lu, Hong Xu, Hung Duong,\nIgor Fedan, Ilia Cherniavskii, Iurii Zdebskyi, Ivan Kobzarev, James\nReed, Jeff Smith, Jerry Chen, Jerry Zhang, Jiakai Liu, Johannes M.\nDieterich, Karl Ostmo, Lin Qiao, Martin Yuan, Michael Suo, Mike Ruberry,\nMikhail Zolothukhin, Mingzhe Li, Neeraj Pradhan, Nick Korovaiko, Owen\nAnderson, Pavel Belevich, Peter Johnson, Pritam Damania, Raghuraman\nKrishnamoorthi, Richard Zou, Roy Li, Rui Zhu, Sebastian Messmer, Shen\nLi, Simon Wang, Supriya Rao, Tao Xu, Thomas Viehmann, Vincent\nQuenneville-Belair, Vishwak Srinivasan, Vitaly Fedyunin, Wanchao Liang,\nWei Yang, Will Feng, Xiaomeng Yang, Xiaoqiang Zheng, Xintao Chen,\nYangqing Jia, Yanli Zhao, Yinghai Lu and Zafar Takhirov." (570:1-587:57, 32637-33864) |
|
├─80 paragraph[3] (589:1-595:4, 33866-34165) |
|
│ ├─0 text "::: {#refs .references .csl-bib-body .hanging-indent}\n::: {#ref-TF .csl-entry}\nAbadi, Martı́n, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,\nCraig Citro, Greg S. Corrado, et al. 2015. \"TensorFlow: Large-Scale\nMachine Learning on Heterogeneous Systems.\"\n" (589:1-594:1, 33866-34131) |
|
│ ├─1 link[1] (594:1-594:30, 34131-34160) |
|
│ │ │ title: null |
|
│ │ │ url: "https://www.tensorflow.org/" |
|
│ │ └─0 text "https://www.tensorflow.org/" (594:2-594:29, 34132-34159) |
|
│ └─2 text ".\n:::" (594:30-595:4, 34160-34165) |
|
├─81 paragraph[1] (597:1-600:4, 34167-34271) |
|
│ └─0 text "::: {#ref-APL .csl-entry}\nAbrams, Philip S. 1970. \"An APL Machine.\" PhD thesis, Stanford\nUniversity.\n:::" (597:1-600:4, 34167-34271) |
|
├─82 paragraph[5] (602:1-605:4, 34273-34402) |
|
│ ├─0 text "::: {#ref-jax .csl-entry}\nal., Matthew Johnson et. 2018. \"Jax.\" " (602:1-603:39, 34273-34337) |
|
│ ├─1 emphasis[1] (603:39-603:58, 34337-34356) |
|
│ │ └─0 text "GitHub Repository" (603:40-603:57, 34338-34355) |
|
│ ├─2 text ".\n" (603:58-604:1, 34356-34358) |
|
│ ├─3 link[1] (604:1-604:32, 34358-34389) |
|
│ │ │ title: null |
|
│ │ │ url: "https://github.com/google/jax" |
|
│ │ └─0 text "https://github.com/google/jax" (604:2-604:31, 34359-34388) |
|
│ └─4 text "; GitHub.\n:::" (604:32-605:4, 34389-34402) |
|
├─83 paragraph[5] (607:1-610:4, 34404-34537) |
|
│ ├─0 text "::: {#ref-flux .csl-entry}\nal., Mike Innes et. 2018. \"Flux.jl.\" " (607:1-608:38, 34404-34468) |
|
│ ├─1 emphasis[1] (608:38-608:57, 34468-34487) |
|
│ │ └─0 text "GitHub Repository" (608:39-608:56, 34469-34486) |
|
│ ├─2 text ".\n" (608:57-609:1, 34487-34489) |
|
│ ├─3 link[1] (609:1-609:36, 34489-34524) |
|
│ │ │ title: null |
|
│ │ │ url: "https://github.com/FluxML/Flux.jl" |
|
│ │ └─0 text "https://github.com/FluxML/Flux.jl" (609:2-609:35, 34490-34523) |
|
│ └─4 text "; GitHub.\n:::" (609:36-610:4, 34524-34537) |
|
├─84 paragraph[5] (612:1-617:4, 34539-34837) |
|
│ ├─0 text "::: {#ref-autodiff_survey .csl-entry}\nBaydin, Atilim Gunes, Barak A. Pearlmutter, Alexey Andreyevich Radul,\nand Jeffrey Mark Siskind. 2017. \"Automatic Differentiation in Machine\nLearning: A Survey.\" " (612:1-615:22, 34539-34738) |
|
│ ├─1 emphasis[1] (615:22-615:44, 34738-34760) |
|
│ │ └─0 text "J. Mach. Learn. Res." (615:23-615:43, 34739-34759) |
|
│ ├─2 text " 18 (1): 5595--5637.\n" (615:44-616:1, 34760-34781) |
|
│ ├─3 link[1] (616:1-616:52, 34781-34832) |
|
│ │ │ title: null |
|
│ │ │ url: "http://dl.acm.org/citation.cfm?id=3122009.3242010" |
|
│ │ └─0 text "http://dl.acm.org/citation.cfm?id=3122009.3242010" (616:2-616:51, 34782-34831) |
|
│ └─4 text ".\n:::" (616:52-617:4, 34832-34837) |
|
├─85 paragraph[5] (619:1-626:4, 34839-35237) |
|
│ ├─0 text "::: {#ref-hoard .csl-entry}\nBerger, Emery D., Kathryn S. McKinley, Robert D. Blumofe, and Paul R.\nWilson. 2000. \"Hoard: A Scalable Memory Allocator for Multithreaded\nApplications.\" In " (619:1-622:19, 34839-35023) |
|
│ ├─1 emphasis[1] (622:19-623:71, 35023-35147) |
|
│ │ └─0 text "Proceedings of the Ninth International Conference on\nArchitectural Support for Programming Languages and Operating Systems" (622:20-623:70, 35024-35146) |
|
│ ├─2 text ",\n117--28. ASPLOS IX. New York, NY, USA: ACM.\n" (623:71-625:1, 35147-35193) |
|
│ ├─3 link[1] (625:1-625:40, 35193-35232) |
|
│ │ │ title: null |
|
│ │ │ url: "https://doi.org/10.1145/378993.379232" |
|
│ │ └─0 text "https://doi.org/10.1145/378993.379232" (625:2-625:39, 35194-35231) |
|
│ └─4 text ".\n:::" (625:40-626:4, 35232-35237) |
|
├─86 paragraph[5] (628:1-632:4, 35239-35459) |
|
│ ├─0 text "::: {#ref-Julia .csl-entry}\nBezanson, Jeff, Alan Edelman, Stefan Karpinski, and Viral B Shah. 2017.\n\"Julia: A Fresh Approach to Numerical Computing.\" " (628:1-630:51, 35239-35389) |
|
│ ├─1 emphasis[1] (630:51-630:64, 35389-35402) |
|
│ │ └─0 text "SIAM Review" (630:52-630:63, 35390-35401) |
|
│ ├─2 text " 59 (1):\n65--98. " (630:64-631:9, 35402-35419) |
|
│ ├─3 link[1] (631:9-631:44, 35419-35454) |
|
│ │ │ title: null |
|
│ │ │ url: "https://doi.org/10.1137/141000671" |
|
│ │ └─0 text "https://doi.org/10.1137/141000671" (631:10-631:43, 35420-35453) |
|
│ └─4 text ".\n:::" (631:44-632:4, 35454-35459) |
|
├─87 paragraph[3] (634:1-638:4, 35461-35691) |
|
│ ├─0 text "::: {#ref-cudnn .csl-entry}\nChetlur, Sharan, Cliff Woolley, Philippe Vandermersch, Jonathan D.\nCohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. \"cuDNN:\nEfficient Primitives for Deep Learning.\" " (634:1-637:42, 35461-35666) |
|
│ ├─1 emphasis[1] (637:42-637:48, 35666-35672) |
|
│ │ └─0 text "CoRR" (637:43-637:47, 35667-35671) |
|
│ └─2 text " abs/1410.0759.\n:::" (637:48-638:4, 35672-35691) |
|
├─88 paragraph[1] (640:1-643:4, 35693-35844) |
|
│ └─0 text "::: {#ref-Torch .csl-entry}\nCollobert, Ronan, Samy Bengio, and Johnny Mariéthoz. 2002. \"Torch: A\nModular Machine Learning Software Library.\" Idiap.\n:::" (640:1-643:4, 35693-35844) |
|
├─89 paragraph[3] (645:1-648:4, 35846-36016) |
|
│ ├─0 text "::: {#ref-Torch7 .csl-entry}\nCollobert, Ronan, Koray Kavukcuoglu, and Clément Farabet. 2011. \"Torch7:\nA Matlab-Like Environment for Machine Learning.\" In " (645:1-647:53, 35846-36000) |
|
│ ├─1 emphasis[1] (647:53-647:64, 36000-36011) |
|
│ │ └─0 text "NIPS 2011" (647:54-647:63, 36001-36010) |
|
│ └─2 text ".\n:::" (647:64-648:4, 36011-36016) |
|
├─90 paragraph[1] (650:1-652:4, 36018-36104) |
|
│ └─0 text "::: {#ref-dlpack .csl-entry}\nDMLC. n.d. \"DLPack: Open in Memory Tensor Structure.\"\n:::" (650:1-652:4, 36018-36104) |
|
├─91 paragraph[5] (654:1-659:4, 36106-36357) |
|
│ ├─0 text "::: {#ref-jemalloc .csl-entry}\nEvans, J. May 2006. \"A Scalable Concurrent Malloc(3) Implementation for\nFreeBSD.\" In " (654:1-656:14, 36106-36222) |
|
│ ├─1 emphasis[1] (656:14-656:58, 36222-36266) |
|
│ │ └─0 text "In BSDCan --- the Technical BSD Conference" (656:15-656:57, 36223-36265) |
|
│ ├─2 text ". Ottawa,\nCanada.\n" (656:58-658:1, 36266-36284) |
|
│ ├─3 link[1] (658:1-658:69, 36284-36352) |
|
│ │ │ title: null |
|
│ │ │ url: "http://people.freebsd.org/˜jasone/jemalloc/bsdcan2006/jemalloc.pdf" |
|
│ │ └─0 text "http://people.freebsd.org/˜jasone/jemalloc/bsdcan2006/jemalloc.pdf" (658:2-658:68, 36285-36351) |
|
│ └─4 text ".\n:::" (658:69-659:4, 36352-36357) |
|
├─92 paragraph[1] (661:1-663:4, 36359-36454) |
|
│ └─0 text "::: {#ref-worse_is_better .csl-entry}\nGabriel, Richard. n.d. \"The Rise of Worse Is Better.\"\n:::" (661:1-663:4, 36359-36454) |
|
├─93 paragraph[3] (665:1-668:4, 36456-36618) |
|
│ ├─0 text "::: {#ref-tcmalloc .csl-entry}\nGhemawat, S., and P. Menage. n.d. \"Tcmalloc: Thread-Caching Malloc.\"\n" (665:1-667:1, 36456-36556) |
|
│ ├─1 link[1] (667:1-667:58, 36556-36613) |
|
│ │ │ title: null |
|
│ │ │ url: "http://goog-perftools.sourceforge.net/doc/tcmalloc.html" |
|
│ │ └─0 text "http://goog-perftools.sourceforge.net/doc/tcmalloc.html" (667:2-667:57, 36557-36612) |
|
│ └─2 text ".\n:::" (667:58-668:4, 36613-36618) |
|
├─94 paragraph[1] (670:1-673:4, 36620-36739) |
|
│ └─0 text "::: {#ref-eigenweb .csl-entry}\nGuennebaud, Gaël, Benoît Jacob, et al. 2010. \"Eigen V3.\"\nhttp://eigen.tuxfamily.org.\n:::" (670:1-673:4, 36620-36739) |
|
├─95 paragraph[5] (675:1-681:4, 36741-37129) |
|
│ ├─0 text "::: {#ref-garbage_collection .csl-entry}\nHertz, Matthew, and Emery D. Berger. 2005. \"Quantifying the Performance\nof Garbage Collection Vs. Explicit Memory Management.\" In " (675:1-677:59, 36741-36912) |
|
│ ├─1 emphasis[1] (677:59-679:51, 36912-37036) |
|
│ │ └─0 text "Proceedings\nof the 20th Annual ACM SIGPLAN Conference on Object-Oriented\nProgramming, Systems, Languages, and Applications" (677:60-679:50, 36913-37035) |
|
│ ├─2 text ", 313--26. OOPSLA '05.\nNew York, NY, USA: ACM. " (679:51-680:25, 37036-37083) |
|
│ ├─3 link[1] (680:25-680:66, 37083-37124) |
|
│ │ │ title: null |
|
│ │ │ url: "https://doi.org/10.1145/1094811.1094836" |
|
│ │ └─0 text "https://doi.org/10.1145/1094811.1094836" (680:26-680:65, 37084-37123) |
|
│ └─4 text ".\n:::" (680:66-681:4, 37124-37129) |
|
├─96 paragraph[1] (683:1-685:4, 37131-37232) |
|
│ └─0 text "::: {#ref-hasktorch .csl-entry}\nHuang, Austin, Junji Hashimoto, and Sam Stites. n.d. \"HaskTorch.\"\n:::" (683:1-685:4, 37131-37232) |
|
├─97 paragraph[3] (687:1-692:4, 37234-37525) |
|
│ ├─0 text "::: {#ref-Caffe .csl-entry}\n\"Jia, Yangqing, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan\nLong, Ross Girshick, Sergio Guadarrama, and Trevor\" Darrell. \"2014\".\n\"\"Caffe: Convolutional Architecture for Fast Feature Embedding\".\"\n" (687:1-691:1, 37234-37474) |
|
│ ├─1 emphasis[1] (691:1-691:37, 37474-37510) |
|
│ │ └─0 text "\"arXiv Preprint arXiv:1408.5093\"" (691:2-691:36, 37475-37509) |
|
│ └─2 text ", \"2014\".\n:::" (691:37-692:4, 37510-37525) |
|
├─98 paragraph[1] (694:1-697:4, 37527-37670) |
|
│ └─0 text "::: {#ref-SciPy .csl-entry}\nJones, Eric, Travis Oliphant, Pearu Peterson, et al. 2001--. \"SciPy:\nOpen Source Scientific Tools for Python.\"\n:::" (694:1-697:4, 37527-37670) |
|
├─99 paragraph[1] (699:1-702:4, 37672-37804) |
|
│ └─0 text "::: {#ref-maxdnn .csl-entry}\nLavin, Andrew. 2015. \"maxDNN: An Efficient Convolution Kernel for Deep\nLearning with Maxwell GPUs.\"\n:::" (699:1-702:4, 37672-37804) |
|
├─100 paragraph[3] (704:1-708:4, 37806-38014) |
|
│ ├─0 text "::: {#ref-fast_cnn .csl-entry}\nLavin, Andrew, and Scott Gray. 2016. \"Fast Algorithms for Convolutional\nNeural Networks.\" " (704:1-706:19, 37806-37927) |
|
│ ├─1 emphasis[1] (706:19-707:20, 37927-37999) |
|
│ │ └─0 text "2016 IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR)" (706:20-707:19, 37928-37998) |
|
│ └─2 text ", 4013--21.\n:::" (707:20-708:4, 37999-38014) |
|
├─101 paragraph[3] (710:1-714:4, 38016-38193) |
|
│ ├─0 text "::: {#ref-mnist .csl-entry}\nLeCun, Yann, and Corinna Cortes. n.d. \"MNIST Handwritten Digit\nDatabase.\" http://yann.lecun.com/exdb/mnist/.\n" (710:1-713:1, 38016-38153) |
|
│ ├─1 link[1] (713:1-713:36, 38153-38188) |
|
│ │ │ title: null |
|
│ │ │ url: "http://yann.lecun.com/exdb/mnist/" |
|
│ │ └─0 text "http://yann.lecun.com/exdb/mnist/" (713:2-713:35, 38154-38187) |
|
│ └─2 text ".\n:::" (713:36-714:4, 38188-38193) |
|
├─102 paragraph[1] (716:1-719:4, 38195-38327) |
|
│ └─0 text "::: {#ref-Lush .csl-entry}\nLeCun, Y, and L Bottou. 2002. \"Lush Reference Manual.\" code available at\nhttp://lush.sourceforge.net.\n:::" (716:1-719:4, 38195-38327) |
|
├─103 paragraph[5] (721:1-727:4, 38329-38691) |
|
│ ├─0 text "::: {#ref-Leuck-dual-numbers .csl-entry}\nLeuck, Holger, and Hans-Hellmut Nagel. 1999. \"Automatic Differentiation\nFacilitates OF-Integration into Steering-Angle-Based Road Vehicle\nTracking.\" In " (721:1-724:15, 38329-38522) |
|
│ ├─1 emphasis[1] (724:15-725:63, 38522-38632) |
|
│ │ └─0 text "1999 Conference on Computer Vision and Pattern\nRecognition (CVPR '99), 23-25 June 1999, Ft. Collins, CO, USA" (724:16-725:62, 38523-38631) |
|
│ ├─2 text ",\n2360--65. " (725:63-726:11, 38632-38644) |
|
│ ├─3 link[1] (726:11-726:53, 38644-38686) |
|
│ │ │ title: null |
|
│ │ │ url: "https://doi.org/10.1109/CVPR.1999.784659" |
|
│ │ └─0 text "https://doi.org/10.1109/CVPR.1999.784659" (726:12-726:52, 38645-38685) |
|
│ └─4 text ".\n:::" (726:53-727:4, 38686-38691) |
|
├─104 paragraph[3] (729:1-732:4, 38693-38883) |
|
│ ├─0 text "::: {#ref-cuda_stream .csl-entry}\nLuitjens, Justin. 2014. \"CUDA Streams.\"\n" (729:1-731:1, 38693-38767) |
|
│ ├─1 link[1] (731:1-731:112, 38767-38878) |
|
│ │ │ title: null |
|
│ │ │ url: "http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf" |
|
│ │ └─0 text "http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf" (731:2-731:111, 38768-38877) |
|
│ └─2 text ".\n:::" (731:112-732:4, 38878-38883) |
|
├─105 paragraph[1] (734:1-737:4, 38885-39066) |
|
│ └─0 text "::: {#ref-maclaurin2016phd .csl-entry}\nMaclaurin, Dougal. 2016. \"Modeling, Inference and Optimization with\nComposable Differentiable Procedures.\" PhD thesis, Harvard University.\n:::" (734:1-737:4, 38885-39066) |
|
├─106 paragraph[3] (739:1-742:4, 39068-39196) |
|
│ ├─0 text "::: {#ref-Matlab .csl-entry}\n" (739:1-740:1, 39068-39097) |
|
│ ├─1 emphasis[1] (740:1-740:32, 39097-39128) |
|
│ │ └─0 text "MATLAB and Statistics Toolbox" (740:2-740:31, 39098-39127) |
|
│ └─2 text ". n.d. Natick, Massachusetts, United\nStates: The MathWorks, Inc.\n:::" (740:32-742:4, 39128-39196) |
|
├─107 paragraph[3] (744:1-748:4, 39198-39371) |
|
│ ├─0 text "::: {#ref-Pandas .csl-entry}\nMcKinney, Wes. 2010. \"Data Structures for Statistical Computing in\nPython.\" In " (744:1-746:13, 39198-39306) |
|
│ ├─1 emphasis[1] (746:13-747:7, 39306-39366) |
|
│ │ └─0 text "Proceedings of the 9th Python in Science Conference,\n51-56" (746:14-747:6, 39307-39365) |
|
│ └─2 text ".\n:::" (747:7-748:4, 39366-39371) |
|
├─108 paragraph[5] (750:1-755:4, 39373-39617) |
|
│ ├─0 text "::: {#ref-DyNet .csl-entry}\nNeubig, G., C. Dyer, Y. Goldberg, A. Matthews, W. Ammar, A.\nAnastasopoulos, M. Ballesteros, et al. 2017. \"DyNet: The Dynamic Neural\nNetwork Toolkit.\" " (750:1-753:19, 39373-39551) |
|
│ ├─1 emphasis[1] (753:19-753:35, 39551-39567) |
|
│ │ └─0 text "ArXiv e-Prints" (753:20-753:34, 39552-39566) |
|
│ ├─2 text ", January.\n" (753:35-754:1, 39567-39578) |
|
│ ├─3 link[1] (754:1-754:35, 39578-39612) |
|
│ │ │ title: null |
|
│ │ │ url: "https://arxiv.org/abs/1701.03980" |
|
│ │ └─0 text "https://arxiv.org/abs/1701.03980" (754:2-754:34, 39579-39611) |
|
│ └─4 text ".\n:::" (754:35-755:4, 39612-39617) |
|
├─109 paragraph[1] (757:1-760:4, 39619-39726) |
|
│ └─0 text "::: {#ref-Numpy .csl-entry}\nOliphant, Travis. 2006. \"NumPy: A Guide to NumPy.\" USA: Trelgol\nPublishing.\n:::" (757:1-760:4, 39619-39726) |
|
├─110 paragraph[3] (762:1-766:4, 39728-39982) |
|
│ ├─0 text "::: {#ref-pytorch_autodiff .csl-entry}\nPaszke, Adam, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,\nZachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam\nLerer. 2017. \"Automatic Differentiation in PyTorch.\" In " (762:1-765:57, 39728-39962) |
|
│ ├─1 emphasis[1] (765:57-765:72, 39962-39977) |
|
│ │ └─0 text "NIPS Workshop" (765:58-765:71, 39963-39976) |
|
│ └─2 text ".\n:::" (765:72-766:4, 39977-39982) |
|
├─111 paragraph[1] (768:1-770:4, 39984-40082) |
|
│ └─0 text "::: {#ref-nimtorch .csl-entry}\nPetrantoni, Giovanni, and Jörg Wollenschläger. n.d. \"NimTorch.\"\n:::" (768:1-770:4, 39984-40082) |
|
├─112 paragraph[5] (772:1-776:4, 40084-40310) |
|
│ ├─0 text "::: {#ref-Piponi-dual-numbers .csl-entry}\nPiponi, Dan. 2004. \"Automatic Differentiation, C++ Templates, and\nPhotogrammetry.\" " (772:1-774:18, 40084-40209) |
|
│ ├─1 emphasis[1] (774:18-774:50, 40209-40241) |
|
│ │ └─0 text "J. Graphics, GPU, & Game Tools" (774:19-774:49, 40210-40240) |
|
│ ├─2 text " 9 (4): 41--55.\n" (774:50-775:1, 40241-40257) |
|
│ ├─3 link[1] (775:1-775:49, 40257-40305) |
|
│ │ │ title: null |
|
│ │ │ url: "https://doi.org/10.1080/10867651.2004.10504901" |
|
│ │ └─0 text "https://doi.org/10.1080/10867651.2004.10504901" (775:2-775:48, 40258-40304) |
|
│ └─4 text ".\n:::" (775:49-776:4, 40305-40310) |
|
├─113 paragraph[5] (778:1-782:4, 40312-40502) |
|
│ ├─0 text "::: {#ref-R .csl-entry}\nR Core Team. n.d. " (778:1-779:19, 40312-40354) |
|
│ ├─1 emphasis[1] (779:19-780:11, 40354-40411) |
|
│ │ └─0 text "R: A Language and Environment for Statistical\nComputing" (779:20-780:10, 40355-40410) |
|
│ ├─2 text ". Vienna, Austria: R Foundation for Statistical Computing.\n" (780:11-781:1, 40411-40470) |
|
│ ├─3 link[1] (781:1-781:28, 40470-40497) |
|
│ │ │ title: null |
|
│ │ │ url: "http://www.R-project.org/" |
|
│ │ └─0 text "http://www.R-project.org/" (781:2-781:27, 40471-40496) |
|
│ └─4 text ".\n:::" (781:28-782:4, 40497-40502) |
|
├─114 paragraph[5] (784:1-792:4, 40504-41004) |
|
│ ├─0 text "::: {#ref-Hogwild .csl-entry}\nRecht, Benjamin, Christopher Ré, Stephen J. Wright, and Feng Niu. 2011.\n\"Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient\nDescent.\" In " (784:1-787:14, 40504-40687) |
|
│ ├─1 emphasis[1] (787:14-789:68, 40687-40879) |
|
│ │ └─0 text "Advances in Neural Information Processing Systems 24: 25th\nAnnual Conference on Neural Information Processing Systems 2011.\nProceedings of a Meeting Held 12-14 December 2011, Granada, Spain." (787:15-789:67, 40688-40878) |
|
│ ├─2 text ",\n693--701.\n" (789:68-791:1, 40879-40891) |
|
│ ├─3 link[1] (791:1-791:109, 40891-40999) |
|
│ │ │ title: null |
|
│ │ │ url: "http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent" |
|
│ │ └─0 text "http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent" (791:2-791:108, 40892-40998) |
|
│ └─4 text ".\n:::" (791:109-792:4, 40999-41004) |
|
├─115 paragraph[5] (794:1-800:4, 41006-41320) |
|
│ ├─0 text "::: {#ref-CNTK .csl-entry}\nSeide, Frank, and Amit Agarwal. 2016. \"CNTK: Microsoft's Open-Source\nDeep-Learning Toolkit.\" In " (794:1-796:28, 41006-41129) |
|
│ ├─1 emphasis[1] (796:28-797:65, 41129-41229) |
|
│ │ └─0 text "Proceedings of the 22Nd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining" (796:29-797:64, 41130-41228) |
|
│ ├─2 text ",\n2135--35. KDD '16. New York, NY, USA: ACM.\n" (797:65-799:1, 41229-41274) |
|
│ ├─3 link[1] (799:1-799:42, 41274-41315) |
|
│ │ │ title: null |
|
│ │ │ url: "https://doi.org/10.1145/2939672.2945397" |
|
│ │ └─0 text "https://doi.org/10.1145/2939672.2945397" (799:2-799:41, 41275-41314) |
|
│ └─4 text ".\n:::" (799:42-800:4, 41315-41320) |
|
├─116 paragraph[3] (802:1-807:4, 41322-41566) |
|
│ ├─0 text "::: {#ref-EBLearn .csl-entry}\nSermanet, Pierre, Koray Kavukcuoglu, and Yann LeCun. 2009. \"Eblearn:\nOpen-Source Energy-Based Learning in c++.\" In " (802:1-804:47, 41322-41467) |
|
│ ├─1 emphasis[1] (804:47-805:64, 41467-41546) |
|
│ │ └─0 text "2009 21st IEEE\nInternational Conference on Tools with Artificial Intelligence" (804:48-805:63, 41468-41545) |
|
│ └─2 text ",\n693--97. IEEE.\n:::" (805:64-807:4, 41546-41566) |
|
├─117 paragraph[3] (809:1-814:4, 41568-41859) |
|
│ ├─0 text "::: {#ref-starcraft_pytorch .csl-entry}\nSynnaeve, G., Z. Lin, J. Gehring, D. Gant, V. Mella, V. Khalidov, N.\nCarion, and N. Usunier. 2018. \"Forward Modeling for Partial Observation\nStrategy Games - a Starcraft Defogger.\" In " (809:1-812:44, 41568-41792) |
|
│ ├─1 emphasis[1] (812:44-813:32, 41792-41843) |
|
│ │ └─0 text "Advances in Neural\nInformation Processing Systems" (812:45-813:31, 41793-41842) |
|
│ └─2 text ", 10761--71.\n:::" (813:32-814:4, 41843-41859) |
|
├─118 paragraph[1] (816:1-818:4, 41861-41959) |
|
│ └─0 text "::: {#ref-python_gil .csl-entry}\nteam, The Python. n.d. \"The CPython Global Interpreter Lock.\"\n:::" (816:1-818:4, 41861-41959) |
|
├─119 paragraph[3] (820:1-822:4, 41961-42059) |
|
│ ├─0 text "::: {#ref-autograd_profiler .csl-entry}\nteam, The PyTorch. n.d.a. " (820:1-821:27, 41961-42027) |
|
│ ├─1 emphasis[1] (821:27-821:54, 42027-42054) |
|
│ │ └─0 text "Pytorch Autograd Profiler" (821:28-821:53, 42028-42053) |
|
│ └─2 text ".\n:::" (821:54-822:4, 42054-42059) |
|
├─120 paragraph[3] (824:1-826:4, 42061-42132) |
|
│ ├─0 text "::: {#ref-torchscript .csl-entry}\n---------. n.d.b. " (824:1-825:19, 42061-42113) |
|
│ ├─1 emphasis[1] (825:19-825:33, 42113-42127) |
|
│ │ └─0 text "Torch Script" (825:20-825:32, 42114-42126) |
|
│ └─2 text ".\n:::" (825:33-826:4, 42127-42132) |
|
├─121 paragraph[5] (828:1-832:4, 42134-42361) |
|
│ ├─0 text "::: {#ref-Theano .csl-entry}\nTheano Development Team. 2016. \"[Theano: A Python framework for fast\ncomputation of mathematical expressions]{.nocase}.\" " (828:1-830:53, 42134-42284) |
|
│ ├─1 emphasis[1] (830:53-830:69, 42284-42300) |
|
│ │ └─0 text "arXiv e-Prints" (830:54-830:68, 42285-42299) |
|
│ ├─2 text "\nabs/1605.02688 (May). " (830:69-831:23, 42300-42323) |
|
│ ├─3 link[1] (831:23-831:56, 42323-42356) |
|
│ │ │ title: null |
|
│ │ │ url: "http://arxiv.org/abs/1605.02688" |
|
│ │ └─0 text "http://arxiv.org/abs/1605.02688" (831:24-831:55, 42324-42355) |
|
│ └─4 text ".\n:::" (831:56-832:4, 42356-42361) |
|
├─122 paragraph[5] (834:1-841:4, 42363-42752) |
|
│ ├─0 text "::: {#ref-Chainer .csl-entry}\nTokui, Seiya, Kenta Oono, Shohei Hido, and Justin Clayton. 2015.\n\"Chainer: A Next-Generation Open Source Framework for Deep Learning.\" In\n" (834:1-837:1, 42363-42531) |
|
│ ├─1 emphasis[1] (837:1-839:16, 42531-42684) |
|
│ │ └─0 text "Proceedings of Workshop on Machine Learning Systems (LearningSys) in\nthe Twenty-Ninth Annual Conference on Neural Information Processing\nSystems (NIPS)" (837:2-839:15, 42532-42683) |
|
│ ├─2 text ".\n" (839:16-840:1, 42684-42686) |
|
│ ├─3 link[1] (840:1-840:62, 42686-42747) |
|
│ │ │ title: null |
|
│ │ │ url: "http://learningsys.org/papers/LearningSys_2015_paper_33.pdf" |
|
│ │ └─0 text "http://learningsys.org/papers/LearningSys_2015_paper_33.pdf" (840:2-840:61, 42687-42746) |
|
│ └─4 text ".\n:::" (840:62-841:4, 42747-42752) |
|
└─123 paragraph[5] (843:1-849:4, 42754-43047) |
|
├─0 text "::: {#ref-starcraft2 .csl-entry}\nVinyals, Oriol, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander\nSasha Vezhnevets, Michelle Yeo, Alireza Makhzani, et al. 2017.\n\"StarCraft II: A New Challenge for Reinforcement Learning.\" " (843:1-846:61, 42754-42982) |
|
├─1 emphasis[1] (846:61-846:67, 42982-42988) |
|
│ └─0 text "CoRR" (846:62-846:66, 42983-42987) |
|
├─2 text "\nabs/1708.04782. " (846:67-847:17, 42988-43005) |
|
├─3 link[1] (847:17-847:50, 43005-43038) |
|
│ │ title: null |
|
│ │ url: "http://arxiv.org/abs/1708.04782" |
|
│ └─0 text "http://arxiv.org/abs/1708.04782" (847:18-847:49, 43006-43037) |
|
└─4 text ".\n:::\n:::" (847:50-849:4, 43038-43047) |