soumith · April 23, 2018 14:28
diff --git a/perf.txt b/perf.txt
 https://github.com/pytorch/pytorch/pull/3526 Reuse intermediate results over multiple backwards grad_inputs
 https://github.com/pytorch/pytorch/pull/3509 Optimizer: optimize transposes in variety of circumstances
 https://github.com/pytorch/pytorch/pull/3409 Add Tensor Core ops to RNNs for Volta
 https://github.com/pytorch/pytorch/pull/3370 Follow up #3211 (sparse broadcast_coalesced, reduce_add_coalesced)
 https://github.com/pytorch/pytorch/pull/3336 Prevent numerical issues with poisson_nll_loss when log_input=False
 https://github.com/pytorch/pytorch/pull/2764 [Done]parallelize elementwise operation with openmp
 https://github.com/pytorch/pytorch/pull/6110 Fix bilinear performance regression
 https://github.com/pytorch/pytorch/pull/6078 Exp, log, sin, cos vectorized
 https://github.com/pytorch/pytorch/pull/6062 Enable MKLDNN convolution forward and backward
 https://github.com/pytorch/pytorch/pull/6026 Speed up sum over a dimension
 https://github.com/pytorch/pytorch/pull/5913 Optimize unique sorting by using std::vector+sort instead of std::set
 https://github.com/pytorch/pytorch/pull/5782 fused GLU backward
 https://github.com/pytorch/pytorch/pull/5747 Save self.numel() for backward computation instead of self
 https://github.com/pytorch/pytorch/pull/5722 Add optimization to norm for common norms
 https://github.com/pytorch/pytorch/pull/5710 improve occupancy for cuda rngs
 https://github.com/pytorch/pytorch/pull/5680  implement TripletMarginLoss as a native function
 https://github.com/pytorch/pytorch/pull/5646 implement CosineEmbeddingLoss as a native function and add reduce arg
 https://github.com/pytorch/pytorch/pull/5640 Revert "implement CosineEmbeddingLoss as a native function and add reduce arg"
 https://github.com/pytorch/pytorch/pull/5447 implement CosineEmbeddingLoss as a native function and add reduce arg
 https://github.com/pytorch/pytorch/pull/5433 speed up CPU EmbeddingBag (indexSelectAdd op)
 https://github.com/pytorch/pytorch/pull/5346 Implement MarginRankingLoss as native function and add reduce=True arg to it
 https://github.com/pytorch/pytorch/pull/5279 Speed-up nn.Linear for the 3d input case
 https://github.com/pytorch/pytorch/pull/5080 Implement hinge_embedding_loss as a native function.
 https://github.com/pytorch/pytorch/pull/5064 DDP: 10% of NCCL backend perf improvements with mixed-prec support
 https://github.com/pytorch/pytorch/pull/5054 Use fast integer division algorithm to avoid division ops inside kernels.
 https://github.com/pytorch/pytorch/pull/5010 add AVX2 implementation for sigmoid function
 https://github.com/pytorch/pytorch/pull/4924 add reduce=True argument to MultiLabelMarginLoss
 https://github.com/pytorch/pytorch/pull/4870 Slightly improve DistributedDataParallel (single-GPU binding) multi-process distributed training performance
 https://github.com/pytorch/pytorch/pull/4824 parallelize vol2col and col2vol of Conv3D with CPU backend
 https://github.com/pytorch/pytorch/pull/4803 More efficient squeeze() backward in edge case
 https://github.com/pytorch/pytorch/pull/4705 adds reduce argument to BCEWithLogitsLoss interface
 https://github.com/pytorch/pytorch/pull/4312 Vectorize normal_
 https://github.com/pytorch/pytorch/pull/4231 Add reduce arg to BCELoss
 https://github.com/pytorch/pytorch/pull/4183 Allowing usage of GPU Direct within PyTorch for the Broadcast operation
 https://github.com/pytorch/pytorch/pull/4174 Rearrange dimensions for pointwise operations for better performance.
 https://github.com/pytorch/pytorch/pull/4094 Implement pin_memory() as a NativeFunction
	https://github.com/pytorch/pytorch/pull/3526 Reuse intermediate results over multiple backwards grad_inputs
	https://github.com/pytorch/pytorch/pull/3509 Optimizer: optimize transposes in variety of circumstances
	https://github.com/pytorch/pytorch/pull/3409 Add Tensor Core ops to RNNs for Volta
	https://github.com/pytorch/pytorch/pull/3370 Follow up #3211 (sparse broadcast_coalesced, reduce_add_coalesced)
	https://github.com/pytorch/pytorch/pull/3336 Prevent numerical issues with poisson_nll_loss when log_input=False
	https://github.com/pytorch/pytorch/pull/2764 [Done]parallelize elementwise operation with openmp
	https://github.com/pytorch/pytorch/pull/6110 Fix bilinear performance regression
	https://github.com/pytorch/pytorch/pull/6078 Exp, log, sin, cos vectorized
	https://github.com/pytorch/pytorch/pull/6062 Enable MKLDNN convolution forward and backward
	https://github.com/pytorch/pytorch/pull/6026 Speed up sum over a dimension
	https://github.com/pytorch/pytorch/pull/5913 Optimize unique sorting by using std::vector+sort instead of std::set
	https://github.com/pytorch/pytorch/pull/5782 fused GLU backward
	https://github.com/pytorch/pytorch/pull/5747 Save self.numel() for backward computation instead of self
	https://github.com/pytorch/pytorch/pull/5722 Add optimization to norm for common norms
	https://github.com/pytorch/pytorch/pull/5710 improve occupancy for cuda rngs
	https://github.com/pytorch/pytorch/pull/5680 implement TripletMarginLoss as a native function
	https://github.com/pytorch/pytorch/pull/5646 implement CosineEmbeddingLoss as a native function and add reduce arg
	https://github.com/pytorch/pytorch/pull/5640 Revert "implement CosineEmbeddingLoss as a native function and add reduce arg"
	https://github.com/pytorch/pytorch/pull/5447 implement CosineEmbeddingLoss as a native function and add reduce arg
	https://github.com/pytorch/pytorch/pull/5433 speed up CPU EmbeddingBag (indexSelectAdd op)
	https://github.com/pytorch/pytorch/pull/5346 Implement MarginRankingLoss as native function and add reduce=True arg to it
	https://github.com/pytorch/pytorch/pull/5279 Speed-up nn.Linear for the 3d input case
	https://github.com/pytorch/pytorch/pull/5080 Implement hinge_embedding_loss as a native function.
	https://github.com/pytorch/pytorch/pull/5064 DDP: 10% of NCCL backend perf improvements with mixed-prec support
	https://github.com/pytorch/pytorch/pull/5054 Use fast integer division algorithm to avoid division ops inside kernels.
	https://github.com/pytorch/pytorch/pull/5010 add AVX2 implementation for sigmoid function
	https://github.com/pytorch/pytorch/pull/4924 add reduce=True argument to MultiLabelMarginLoss
	https://github.com/pytorch/pytorch/pull/4870 Slightly improve DistributedDataParallel (single-GPU binding) multi-process distributed training performance
	https://github.com/pytorch/pytorch/pull/4824 parallelize vol2col and col2vol of Conv3D with CPU backend
	https://github.com/pytorch/pytorch/pull/4803 More efficient squeeze() backward in edge case
	https://github.com/pytorch/pytorch/pull/4705 adds reduce argument to BCEWithLogitsLoss interface
	https://github.com/pytorch/pytorch/pull/4312 Vectorize normal_
	https://github.com/pytorch/pytorch/pull/4231 Add reduce arg to BCELoss
	https://github.com/pytorch/pytorch/pull/4183 Allowing usage of GPU Direct within PyTorch for the Broadcast operation
	https://github.com/pytorch/pytorch/pull/4174 Rearrange dimensions for pointwise operations for better performance.
	https://github.com/pytorch/pytorch/pull/4094 Implement pin_memory() as a NativeFunction