Convolution in deep learning works by applying a kernel (a small matrix) to a larger input matrix. You slide this kernel on the input matrix from the top left to the bottom right. You perform element-wise multiplication on each slide (where the sliding distance is the stride length), then you sum all the multplications into a single number. This number is then put in the output matrix.
Strictly speaking, convolution requires the kernel to be flipped horizontally and vertically (transposed). The reason to do this is to acquire the commutative property of convolution. However this is not required in deep learning frameworks or neural network implementations because as a human you don't interpret the kernel. And then you combine this with other operators that are not commutative anyway. The mathematical term for this untransposed convolution is "cross-correlation". See: https://www.deeplearningbook.org/contents/convnets.html
Theano does the proper convolution by transposing the kernel. While Tensorflow does not, so converting weights between requires flipping the kernels.