Seeing the Arrow of Time

Introduction

Given a video, can a machine learning system detect the arrow of time and distinguish whether the video is running forward or backwards.
Link to the paper

Youtube Dataset
- 180 short videos (6-10 seconds) are manually selected using keywords like "dance", "stream trains" etc.
- 155 forward videos and 25 reverse videos - highly imbalanced dataset!!
Tennis Ball Dataset
- Recorded 13 HD videos of tennis ball rolling and colliding on the floor.

Spatial-temporal Oriented Energy (SOE) as off the shelf feature extractor.
Split videos into 2x2 spatial subregions and concatenate SOE features to obtain final features.
These final features are fed to linear SVM classifier and the performance varies from 48% to 60%.
One reason for poor performance could be the difficulty in generalising motion over different sub-regions.

Idea is to capture local regions of motion in a video to examine what type of motion is a good feature for detecting the arrow of time.
Flow Words are object-motion descriptors based on SIFTlike descriptors and capture motion occurring in small patches of videos.
These descriptors are motion quantized to obtain a discrete set of flow words.
The entire video sequence can be encoded as a bag of flow-word descriptors which becomes the features for the learning system.

Training
- For each video, 4 descriptor histograms were extracted:
  - (A): the native direction of the video
  - (B): this video mirrored in the left-right direction
  - (C): the original video time-flipped
  - (D): the time-flipped left-right-mirrored version
- Train an SVM using the 4 histograms and combine their scores as A + B - C - D expecting a positive result for forwarding clips and negative for backwards clips.
Result
- Performance varies from 75% to 90%

Idea is to capture motion causing other motions as it is more common for one motion to cause multiple motions instead of multiple motions collapsing into one motion.
The system looks at the regions in the video from frame to frame with the expectation that, in the forwards-time direction, there would be more occurrences of one region splitting in two than of two regions joining to become one.

Result
- Performance varies from 70% to 73%.
- Though it underperforms as compared to the flow-word method, it can complement that method as Motion-causation considers the spatial location of motions while flow-word method considers motion in each frame separately.

Idea is to model the problem as that of inferring casual direction in cause-effect models.
The assumption is that some image motions will be modelled as AR models with additive non-Gaussian noise.
In such a scenario, noise added at some point in time, is independent of the past values of the time series but not of future values.
This allows independence tests to be performed for determining the direction of time.

Result
- There is a tradeoff between the accuracy achieved by the system versus the number of videos it can classify (depending on the value of delta for p-test).

The paper poses a new and interesting research problem but uses a very small dataset which makes the results inconclusive in my opinion.