Skip to content

Instantly share code, notes, and snippets.

@SqrtRyan
Created January 22, 2026 19:22
Show Gist options
  • Select an option

  • Save SqrtRyan/128de070ae76e6878da9163f23a7a439 to your computer and use it in GitHub Desktop.

Select an option

Save SqrtRyan/128de070ae76e6878da9163f23a7a439 to your computer and use it in GitHub Desktop.
Dear Authors of Submission #45242,
Thanks to the efforts of all reviewers, emergency reviewers, ACs, and SACs, preliminary reviews are now available. Below you will find the preliminary reviews for your CVPR 2026 submission, MotionV2V: Editing Motion in a Video (#45242). Authors have the opportunity to submit a rebuttal by January 29, 2026 11:59 PM AoE. Please review the rest of this email, and the Author Guidelines for additional details on the rebuttal process.
We refer to the current reviews as preliminary because review revisions to improve clarity are possible for reviewers until January 23 11:59pm AoE.
Reviews
You can access the reviews for your submission by clicking on the corresponding submission in OpenReview: https://openreview.net/group?id=thecvf.com/CVPR/2026/Conference/Authors
A small number of papers received more than three reviews. If your paper is one of these, it is simply due to the Area Chair’s strenuous efforts in securing emergency reviewers, and does not indicate anything special about your paper.
Rebuttal
If you wish to respond to the reviews, you may upload a one-page PDF document using the template that can be found in the CVPR Author Kit (rebuttal.tex). The entire rebuttal must fit on one page, including any references. To upload your rebuttal, click the “Rebuttal” button on the page of the paper, upload the PDF, and submit the form.
Rebuttals longer than one page will not be reviewed. The same goes for rebuttals where the margins and formatting are deemed to have been altered from the template.
The rebuttal also CANNOT include external links to videos, code repositories, etc (anonymous or otherwise).
The rebuttal must maintain anonymity.
The goal of the rebuttal is to refute any factual errors in reviews or to supply additional information or clarifications requested by the reviewers. Rebuttals may include minor additional experiments or analysis requested by reviewers. They may also include figures, graphs or proofs to better illustrate your arguments. Rebuttals MUST NOT add new contributions (theorems, algorithms, experiments) that were absent in the original submission and were not specifically requested by the reviewers. We allow reviewers to ask for small experiments that could be reasonably run within the rebuttal phase with academic resources. If reviewers ask for experiments that require too much resource or time to run, authors have the right to decline those. Authors should still refrain from including significant new experiment results in the rebuttal not specifically requested by the reviewers.
Confidential Comment to AC
In addition to the PDF, the rebuttal submission form includes an optional box for authors to write a confidential comment to the Area Chair, to address any significant concerns related to policies, ethics, etc. Please do so only in exceptional circumstances. Note that overly lengthy or charged comments are unlikely to benefit the outcome of your paper.
Withdrawal
If you wish to withdraw your paper after reading the reviews, you may do so by pressing the “Withdraw” button in the paper’s page, confirming the withdrawal, and submitting the form.
Best Regards,
CVPR 2026 Program Chairs
Angela Dai, Adriana Kovashka, Chen Change Loy, Vladimir Pavlovic, Alex Schwing, Shaoting Zhang
Reviewer XU4u
Paper Summary
MotionV2V introduces a framework for precise motion control in existing videos by directly manipulating sparse trajectories extracted from the input. Unlike prior methods that often rely on a single initial frame, this approach conditions on the full video sequence, enabling the modification of object and camera motion while rigorously preserving the original scene content.
Paper Strengths
The paper’s primary strength lies in the task it solves; its novel formulation of precise motion editing as a Video-to-Video (V2V) task, addressing a significant gap in existing literature where previous methods were largely restricted to image-to-video animation or local appearance-based editing.
A major contribution is the introduction of the "motion counterfactual" training dataset generation pipeline, which effectively generates paired datasets sharing identical visual content but distinct motion patterns to fine-tune the diffusion model.
Although some of major design choices, such as the point track representation and the ControlNet-like architecture, lack novelty (e.g., the point track representation sounds somewhat mixture of the one used in MotionPrompting and the one in Diffusion-as-Shader), the choices sound reasonable and, as a result of simple yet effective design, MotionV2V significantly outperforms existing solutions.
Major Weaknesses
I am not convinced how the 'Motion Counterfactual Video Generation' makes ideal training dataset pairs. Given two objects (object A and B) with distinct object motions respectively in a source video, an ideal target video should include the case of object A with same motion as in source video and object B with different motion from the source video. How does video frame interpolation ensures this? Wouldn’t the video frame interpolation result in both object A and object B having different motion from the source video?
The paper needs to provide more details on implementations. Is control branch patchifier randomly initialised? Which model is used for the video frame interpolation model?
Can the authors show examples with more complex camera control, other than zoom out + object motion control. Moreover, are the existing output videos in the suppl result of iterative editing or editing in one-go?
Why are the tracking points uniformly sampled from the frame dimension rather than randomly sampled? Moreover, shouldn’t the tracking points (in training dataset) be focused more on objects?
For user study and quantitative evaluation, are two independent test datasets used? Also, what is the source of ‘random internet videos’ in the test dataset?
Lastly, the writing of the manuscript should be largely improved. As a few examples, tense of the verbs are not consistent (past & current). Shouldn't L336 - L343 move to a separate discussion section, such as in conclusion? Moreover, 'We hypothesize that transformer blocks do non-trivial work to achieve this capability.' is a very open-ended and vague statement.
Minor Weaknesses
Are ‘Frame Interpolation’ and ‘Temporal Resampling’ techniques used jointly in some cases? If not, why isn't this considered?
Can the authors show examples where the model fails to follow all correspondences when given too many tracking points?
Preliminary Recommendation
4: Borderline Accept
Justification For Recommendation And Suggestions For Rebuttal
Please address the raised weakness above. Improved writing of the manuscript in general should add greater value to the work.
Confidence Level
5: Expert - The reviewer is an authority in the specific subfield addressed by the paper. They are extremely confident in their evaluation and understanding of the work.
Reviewer KbGS
Paper Summary
The paper addresses an underexplored problem in video editing, i.e video-to-video under the influence of point trajectory controls. While much of the focus has been on generating the videos from a single image conditioned on the point trajectory inputs as well as stylizing videos (V2V) or object removal (V2V) or colorization (V2V), this work focuses on recreating the motion of specific objects in the video while keeping the appearance of all the objects intact. The applications of a solution to a problem cast thus are undeniable, ranging from reposing of characters to camera motion.
The key contribution is the introduction of the idea of creating motion counterfactual videos derived from the input videos and utilizing them as inputs to train the video-to-video network. The counterfactual videos are created from the dataset of the original videos by sampling two end frames and generating everything in between using a state-of-the-art video generative model, followed by tracking points from the points on the end frames. A small number of qualitaitve results are shown for a variety of applications that can be cast as changes in point trajectories - like the main character performing a different action altogether than in the original video, making the camera static, zoom-in and out.
The quantitative results are underwhelming with comparison only with the I2V methods as well as on metrics that generally do not reflect the true performance of the methodologies.
Paper Strengths
The main strength of the paper is the introduction of the concept of the motion counterfactuals that are derived from the original videos themselves. This can potentially act as a powerful source of data to train the network as well as a differentiator from the original videos only in terms of motion corresponding to the points of interest while retaining the same appearance and motion in other regions.
The work can unlock several applications that can be cast as changing point trajectory information and hence can be very powerful.
The qualitative results in the supplementary are quite impressive and appealing.
Major Weaknesses
The paper lacks from thorough systematic and thorough evaluation of the different scenarios or motion changes. While comparisons with I2V methods like ATI and ReVideo serve as necessary indicators of the proposed approach, they are not sufficient for the obvious reason that the input to the propose approach has all the appearance information for all frames whereas I2V methods has just a single frame. It is a bit surprising that beating I2V results is being highlighted in the paper as outperforming the baselines. I would have liked to see the authors putting more effort in baselines like Drag-A-Video or perhaps thinking of any simpler alternative approaches to video-to-video as baselines. Moreover, the metrics used are L2, SSIM and LPIPS that hardly capture the fine-grained motion changes that are required to be tested for the problem of interest.
The paper is poorly written and the introduction lacks a solid motivation and positioning. Pointing out the failure cases of I2V methods is a weak motivation to be working on a V2V problem. A much stronger motivation would have been to start from the applications of V2V involving motion changes followed by utilization or modifying the existing V2V methods as baselines for this problem, reasoning why they fail before moving to the reasoning behind the main contribution of the paper, the motion counterfactuals. The contribution of the work is barely talked about in the introduction. The paper does not get to the "point" soon enough and leaves the reader searching for what is being done in the work, why is being done in the work. For example, the supplementary video was far more informative about the method and easier to understand than the contents in the paper.
Limitations: There is no limitations section. It is important to show qualitatively a set of scenarios where the method might be failing. A stress test of a range of inputs where it works the best to where the approach fails miserably gives the reader a comprehensive image of the approach that can be tied back to the contribution itself.
Minor Weaknesses
Figure 1: The arrow depiction is a bit confusing and cumbersome. I recommend showing just the figure to folks not part of this work and take feedback to improve the figure.
Preliminary Recommendation
2: Weak Reject
Justification For Recommendation And Suggestions For Rebuttal
While the work has significant promise for V2V applications that involve motion control and the idea of motion counterfactual videos is very interesting, it falls short of acceptance at a top-tier conference due to poor writing and positioning of the paper and very preliminary evaluation. I would like the authors to address the following.
Comparisons with a strong V2V baseline, even if not meant for motion control but something that can be adapted
Stronger evaluation metrics that directly measure the quality of trajectory conditioning
Confidence Level
4: High Confidence - The reviewer has strong expertise in the area. They are highly familiar with the relevant literature and can critically evaluate the paper.
Reviewer RgoT
Paper Summary
MotionV2V introduces a method for controlled motion editing in videos using user-specified point trajectories. While recent work has studied motion controllability (including using point trajectories), much of it targets generating new videos from scratch or only conditions on the first frame, which limits editing of an existing clip over its full duration.
To address this, MotionV2V adapts a pretrained video diffusion model (CogVideoX) to support fine-grained motion control across the entire video by conditioning on edited point tracks. Specifically, they add a learnable ControlNet style branch that takes the input video to be edited and a video representation of the point trajectories. For training, they construct paired data from real videos and motion altered "counterfactuals" via frame interpolation, using the resulting track differences as supervision for motion edits.
Paper Strengths
Unlike first-frame image-to-video motion control methods (e.g. Go with the Flow, ATI), this paper conditions on the entire input video, so it can preserve and manipulate content that only appears mid-sequence. This is a clear limitation of previous approaches and is a good contribution to the field.
The four way user study shows strong results against recent competitors, which is a good indicator that their core claim is supported: conditioning on the full video and enabling control throughout the sequence matters for motion editing while preserving the original clip.
The control signal is simple and interpretable from a user perspective.
Major Weaknesses
While the quantitative metrics do a good job at measuring frame-level reconstruction and show promising results, they miss standard video criteria like temporal consistency. Some supplementary examples show unnatural motion and temporal artifacts, so reporting a temporal metric would clarify whether this is a limitation of the approach or comparable with other methods and a limitation of the base model itself.
The paper would be strengthened by an ablation that sweeps the number of tracking points at inference, since they note sensitivity to "too many points," and this would make the controllability to quality tradeoff much clearer.
Minor Weaknesses
How long does it take to edit a video on average, and how many timesteps are used?
It seems like videos for the competitor methods were missing from the supplementary, so can frames for a couple examples be included in the rebuttal?
Preliminary Recommendation
5: Weak Accept
Justification For Recommendation And Suggestions For Rebuttal
Overall, this paper identifies and addresses an existing gap in motion controllability research with an approach that is actually designed for video-to-video motion editing, not just first-frame conditioning or full generation. The approach is practical and the user study results support their main claim that control throughout the clip improves edit quality. In the rebuttal, I would like to see a small temporal consistency evaluation, clearer qualitative or quantitative evidence on how performance changes with the number of tracking points, and responses to the things mentioned in minor weaknesses.
Confidence Level
3: Moderate Confidence - The reviewer is reasonably knowledgeable about the topic. They understand the paper's methodology and results but may not be a leading expert in the specific subfield.
Please note that responding to this email will direct your reply to [email protected].
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment