- Given a video, can a machine learning system detect the arrow of time and distinguish whether the video is running forward or backwards.
- Link to the paper
-
Youtube Dataset
- 180 short videos (6-10 seconds) are manually selected using keywords like "dance", "stream trains" etc.
- 155 forward videos and 25 reverse videos - highly imbalanced dataset!!
-
Tennis Ball Dataset
- Recorded 13 HD videos of tennis ball rolling and colliding on the floor.
- Spatial-temporal Oriented Energy (SOE) as off the shelf feature extractor.
- Split videos into 2x2 spatial subregions and concatenate SOE features to obtain final features.
- These final features are fed to linear SVM classifier and the performance varies from 48% to 60%.
- One reason for poor performance could be the difficulty in generalising motion over different sub-regions.
-
Idea is to capture local regions of motion in a video to examine what type of motion is a good feature for detecting the arrow of time.
-
Flow Words are object-motion descriptors based on SIFTlike descriptors and capture motion occurring in small patches of videos.
-
These descriptors are motion quantized to obtain a discrete set of flow words.
-
The entire video sequence can be encoded as a bag of flow-word descriptors which becomes the features for the learning system.
- For each video, 4 descriptor histograms were extracted:
- (A): the native direction of the video
- (B): this video mirrored in the left-right direction
- (C): the original video time-flipped
- (D): the time-flipped left-right-mirrored version
- Train an SVM using the 4 histograms and combine their scores as A + B - C - D expecting a positive result for forwarding clips and negative for backwards clips.
- Performance varies from 75% to 90%
- For each video, 4 descriptor histograms were extracted:
-
Idea is to capture motion causing other motions as it is more common for one motion to cause multiple motions instead of multiple motions collapsing into one motion.
-
The system looks at the regions in the video from frame to frame with the expectation that, in the forwards-time direction, there would be more occurrences of one region splitting in two than of two regions joining to become one.
- Performance varies from 70% to 73%.
- Though it underperforms as compared to the flow-word method, it can complement that method as Motion-causation considers the spatial location of motions while flow-word method considers motion in each frame separately.
-
Idea is to model the problem as that of inferring casual direction in cause-effect models.
-
The assumption is that some image motions will be modelled as AR models with additive non-Gaussian noise.
-
In such a scenario, noise added at some point in time, is independent of the past values of the time series but not of future values.
-
This allows independence tests to be performed for determining the direction of time.
- There is a tradeoff between the accuracy achieved by the system versus the number of videos it can classify (depending on the value of delta for p-test).
- The paper poses a new and interesting research problem but uses a very small dataset which makes the results inconclusive in my opinion.