PyTorch is, at its core, a Python library enabling GPU-accelerated tensor computation, similar to NumPy. On top of this, PyTorch provides a rich API for neural network and other machine learning workloads.
PyTorch differentiates itself from other machine learning frameworks in that it does not use static computational graphs -- defined once, ahead of time -- like TensorFlow. Instead, PyTorch computation graphs are dynamic and defined by run. This means that each invocation of a machine learning model's layers define a new computation graph, on the fly. The creation of this graph is implicit, in the sense that the library takes care of recording the flow of data through the program and linking function calls (nodes) together (via edges) into a computation graph.
Besides a plethora of implementations of machine learning specific pointwise functions, objective functions (losses) and optimization algorithms, PyTorch also extends NumPy's API with support for automatic differentiation (autograd). Since many contemporary machine learning algorithms use gradient information to update weights, being able to compute gradients with respect to every parameter of even the most complex neural network models is very important. PyTorch has native support for this.
Fundamentally, PyTorch's execution model is much simpler than that of frameworks like TensorFlow or MXNet employing static graphs. In a sense, PyTorch is just fast, GPU-accelerated Python with added neural network machinery. This has the advantage of much greater flexibility, since arbitrary Python code can be used to define neural networks on the fly, rather than having to pre-define a static computation graph once, ahead of time. This also aids debuggability, since running through a piece of computation code actually performs that computation, rather than just registering its operations in a graph. This means intermediary values can be inspected at any time during execution of the model. At the same time, having a pre-defined computation graph enables more extensive analysis and optimization of that graph, possibly greatly improving its final runtime performance. It is a tradeoff between flexibility and performance.
Note that PyTorch was not the first implementation of dynamic, GPU-accelerated computation. Predecessors to PyTorch include Chainer, DyNet as well as Torch, which exposes an interface in Lua instead of Python. However, PyTorch stands out in being particularly fast and efficient on both CPU and GPU, thus marrying dynamism and flexibility with performance in ways other frameworks did not.
We will begin by diving into the core components of the PyTorch library, covering core datatypes and neural network specific functionality.
The most fundamental datatype in PyTorch is a tensor
. The tensor
datatype is
very similar, both in importance and function, to NumPy's ndarray
.
Furthermore, since PyTorch aims to interoperate reasonably well with NumPy, the
API of tensor
also resembles that of ndarray
.
PyTorch tensors can be created with the torch.Tensor
constructor, which takes
the tensor's dimensions as input and returns a tensor occupying an
uninitialized region of memory:
import torch
x = torch.Tensor(4, 4)
In practice, one will most often want to use one of PyTorch's functions that return tensors initialized in a certain manner, such as:
torch.rand
: Returns a tensor with values initialized from a random uniform distribution,torch.randn
: Returns a tensor with values initialized from a random normal distribution,torch.eye(n)
: Returns ann x n
identity matrix,torch.from_numpy(ndarray)
: Returns a PyTorch tensor from NumPyndarray
,torch.linspace(start, end, steps)
: Returns a 1-D tensor withsteps
values spaced linearly betweenstart
andend
,torch.ones
: Returns a tensor with ones everywhere,torch.zeros_like(other)
: Returns a tensor with the same shape asother
and zeros everywhere,torch.arange(start, end, step)
: Returns a 1-D tensor with values filled from a range.
Similar to NumPy's ndarray
, PyTorch tensors provide a very rich API for
combination with other tensors as well as in-place mutation. Also like NumPy,
unary and binary operations can usually be performed via functions in the
torch
module, like torch.add(x, y)
, or directly via methods on the tensor
objects, like x.add(y)
. For the usual suspects, operator overloads like x + y
exist. Furthermore, many functions have in-place alternatives that will
mutate the receiver instance rather than creating a new tensor. These functions
have the same name as the out-of-place variants, but are suffixed with an
underscore, e.g. x.add_(y)
.
A selection of operations includes:
torch.add(x, y)
: Elementwise addition,torch.mm(x, y)
: Matrix multiplication,torch.mul(x, y)
: Elementwise multiplication,torch.exp(x)
: Elementwise exponential,torch.pow(x, power)
: Elementwise exponentiation,torch.sqrt(x)
: Elementwise squaring,torch.sqrt_(x)
: In-place elementwise squaring,torch.sigmoid(x)
: Elementwise sigmoid.torch.cumprod(x)
: The product of all values,torch.sum(x)
: The sum of all values,torch.std(x)
: The standard deviation of all values,torch.mean(x)
: The mean of all values.
Tensors support many of the familiar semantics of NumPy ndarray
's, such as
broadcasting, advanced (fancy) indexing (x[x > 5]
) and element-wise relational
operators (x > y
). PyTorch tensors can also be converted to NumPy ndarray
's
directly via the torch.Tensor.numpy()
function.
Finally, since the primary improvement of PyTorch tensors over NumPy ndarray
s
is supposed to be GPU acceleration, there is also a torch.Tensor.cuda()
function, which will copy the tensor memory onto a CUDA-capable GPU device, if
one is available.
At the core of most modern machine learning techniques is the calculation of gradients. This is especially true for neural networks, which use the backpropagation algorithm to update weights. For this reason, Pytorch has strong and native support for gradient computation of functions and variables defined within the framework. The technique with which gradients are computed automatically for arbitrary computations is called automatic (sometimes algorithmic) differentiation -- or autograd for short.
Frameworks that employ the static computation graph model implement automatic differentiation by analyzing the graph and adding additional computation nodes to it that compute the gradient of one value with respect to another step by step, piecing together the chain rule by linking these additional gradient nodes with edges.
PyTorch, however, does not have static computation graphs and thus does not have the luxury of adding gradient nodes after the rest of the computations have already been defined. Instead, PyTorch must record or trace the flow of values through the program as they occur, thus creating a computation graph dynamically. Once such a graph is recorded, PyTorch has all the information required to walk this computation flow backwards and calculate gradients of outputs from inputs.
The PyTorch Tensor
does not have sufficient machinery to participate in
automatic differentiation. For a tensor to be "recordable", it must be wrapped
with torch.autograd.Variable
. The Variable
class provides almost the same
API as Tensor
, but adds to it the ability to interplay with
torch.autograd.Function
in order to be differentiated automatically. More
precisely, a Variable
records the history of operations on a Tensor
.
Usage of torch.autograd.Variable
is very simple. One needs only to pass it a
Tensor
and inform torch whether or not this variable requires recording of
gradients:
x = torch.autograd.Variable(torch.ones(4, 4), requires_grad=True)
The requires_grad
function may need to be False
in the case of data inputs
or labels, for example, since those are usually not differentiated. However,
they still need to be Variable
s to be usable in automatic differentiation.
Note that requires_grad
defaults to False
, thust must be set to True
for learnable parameters.
To compute gradients and perform automatic differentiation, one calls the
backward()
function on a Variable
. This will compute the gradient of that
tensor with respect to the leaves of the computation graph (i.e. all inputs
that influenced that value). These gradients are then collected in the
Variable
class' grad
member. The original tensor data can be accessed via
the Variable.data
member.
Since every Variable
except for inputs are the result of an operation, each
Variable
has an associated grad_fn
, which is the torch.autograd.Function
that is used to compute the backward step (except for inputs, for which
grad_fn
is None
).
The torch.nn
module exposes neural-network specific functionality to PyTorch
users. One of its most important members is torch.nn.Module
, which represents
a reusable block of operations and associated (trainable) parameters, most
commonly used for neural network layers. Modules may contain other modules and
implicitly get a backward()
function for backpropagation. An example of a
module is torch.nn.Linear()
, which represents a linear (dense/fully-connected)
layer (i.e. an affine transformation Wx + b
).
During training, one will most often call backward()
on a module to compute
gradients for its variables. Since calling backward()
sets the grad
member
of Variable
s, there is also a nn.Module.zero_grad()
method that will reset
the grad
member of all Variable
s to zero.
torch.nn
also provides a number of loss functions that are naturally
important to machine learning applications. Examples of loss functions include:
torch.nn.MSELoss
: a mean-squared-error loss,torch.nn.BCELoss
: a binary-cross-entropy loss,torch.nn.KLDivLoss
: a KL-divergence loss.
Note that in PyTorch jargon, loss functions are often called criterions.
After neural network building blocks (nn.Module
) and loss functions, the last
piece of the puzzle is an optimizer to run (a variant of) stochastic gradient
descent. For this, PyTorch provides the torch.optim
subpackage, which defines
a number of useful optimization algorithms. A few of them are:
torch.optim.SGD
,torch.optim.Adam
,torch.optim.RMSprop
,torch.optim.LBFGS
.
Each of these optimizers are constructed with a list of parameter objects,
usually retrieved via the parameters()
method of a nn.Module
subclass, that
determine which values are updated by the optimizer. Besides this parameter
list, the optimizers each take a certain number of additional arguments to
configure their optimization strategy, of course.
For convenience, PyTorch provides a number of utilities to load, preprocess and
interact with datasets. These helper classes and functions are found in the
torch.utils.data
module. The two major concepts here are:
Dataset
s, which encapsulate a source of data,DataLoader
s, which are responsible for loading datasets, possibly in parallel.
New datasets are created by subclassing the torch.utils.data.Dataset
class and
overriding the __len__
method to return the number of samples in the dataset
and the __getitem__
method to access a single value at a certain index. For
example, this would be a simple dataset encapsulating a range of integers:
class RangeDataset(torch.utils.data.Dataset):
def __init__(self, start, end, step=1):
self.start = start
self.end = end
self.step = step
def __len__(self, length):
return math.ceil((self.end - self.start) / self.step)
def __getitem__(self, index):
value = self.start + index * self.step
assert value < self.end
return value
Inside __init__
we would usually configure some paths or change the set of
samples ultimately returned. In __len__
, we specify the upper bound for the
index with which __getitem__
may be called, and in __getitem__
we return the
actual sample, which could be an image or an audio snippet.
To iterate over the dataset we could, in theory, simply have a for i in range
loop and access samples via __getitem__
. However, it would be much more
convenient if the dataset implemented the iterator protocol itself, so we could
simply loop over samples with for sample in dataset
. Fortunately, this
functionality is provided by the DataLoader
class. A DataLoader
object takes
a dataset and a number of options that configure the way samples are retrieved.
For example, it is possible to load samples in parallel, using multiple
processes. For this, the DataLoader
constructor takes a num_workers
argument. Note that DataLoader
s always return batches, whose size is set with
the batch_size
parameter. Here is a simple example:
dataset = RangeDataset(0, 10)
data_loader = torch.utils.data.DataLoader(
dataset, batch_size=4, shuffle=True, num_workers=4, drop_last=True)
for i, batch in enumerate(data_loader):
print(i, batch)
Here, we set batch_size
to 4
, so returned tensors will contain exactly four
values. By passing shuffle=True
, the index sequence with which data is
accessed is permuted, such that individual samples will be returned in random
order. We also passed drop_last=True
, so that if the number of samples left
for the final batch of the dataset is less than the specified batch_size
, that
batch is not returned. This ensures that all batches have the same number of
elements, which may be an invariant that we need. Finally, we specified
num_workers
to be four, meaning data will be fetched in parallel by four
separate processes in a process pool.
Once the DataLoader
has been created, iterating over the dataset and thereby
retrieving batches is simple and natural. A final interesting observation I want
to share is that the DataLoader
actuallly has some reasonably sophisticated
logic
to determine how to collate individual samples returned from your dataset's
__getitem__
method into a batch, as returned by the DataLoader
during
iteration. For example, if __getitem__
returns a dictionary, the DataLoader
will aggregate the values of that dictionary into a single mapping for the
entire batch, using the same keys. This means that if the Dataset
's'
__getitem__
returns dict(example=example, label=label)
, then the batch
returned by the DataLoader
will return something like dict(example=[example1, example2, ...], label=[label1, label2, ...])
, i.e. unpacking the values of
indidvidual samples and re-packing them into a single key for the batch's
dictionary. To override this behavior, you can pass a function argument for the
collate_fn
parameter to the DataLoader
object.
Note that the torchvision
package already provides a number of datasets, such
as torchvision.datasets.CIFAR10
, ready to use. The same is true for
torchaudio
and torchtext
packages.
Now that we have covered the basic components of PyTorch, as well as functionality related to loading data, we are ready to use PyTorch to implement a full-fledged deep learning model. We will begin with an image classifier, and then move on to a deep convolutional GAN (DCGAN) to generate synthetic images.