Force Prompting: Video Generation Models Can
Learn and Generalize Physics-based Control Signals

1Brown University, 2Google DeepMind

1. Train a Force Conditioned Video Model
with Limited Synthetic Data

Local Force Model (Poke)



Global Force Model (Wind)

2. Video Model Generalizes Force Conditioning

Generalizes to Different Settings and Materials

See More

Generalizes to Different Objects and Geometries

See More

Generalizes to Different Affordances

See More

Hints at Mass Understanding

See More

Overview

We investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric.

The main challenge of force prompting is the difficulty in obtaining high quality paired force-video training data. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects (e.g., flying flags, rolling balls, etc.). Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations on the training data that reveal two key elements: visual diversity and the use of specific text keywords during training.

In addition, our approach is trained on only around 15k training examples for a single day on four A100 GPUs, making these techniques broadly accessible for future research.



Interacting with Images Using Force Prompts

A user can interact with an image by specifying a force vector (location, angle, magnitude) on the image. With this force prompt, the video generator then generates the resultant scene. No physics simulator used at inference time!

While currently the results are not real-time or per-frame causal (though it is causal with respect to the conditioning signal), we believe that they show the potential of future video generation models as they get faster, more efficient, and more powerful.

Local Force Prompts


Interactive Force Prompting Demos: Try It Yourself! Click on a thumbnail below to select a demo. Then, click on the white bead in the image and drag along the indicated line. Release the mouse to see the generated video!

Global Force Prompts


Interactive Force Prompting Demos: Try It Yourself! Click on a thumbnail below to select a demo. Then, click on the wind icon to select a wind direction and release the mouse to see the generated video!



Training dataset diversity



The global wind force model is trained on 15k synthetic videos of flags in the wind. The model learns how wind is supposed to affect the flags and generalizes the wind control signal to diverse types of motions, including tethered and aerodynamic motion, as well as fluid dynamics. Pictured here are three different scenes of flags being blown to the right with varying force magnitudes.



The local point force model is trained on 11k videos of plants being poked, and 12k videos of balls being poked. This unified dataset allows for modeling of linear motions, as well as oscillatory and complex motions. Pictured here are three different scenes of plants being being poked to the left with varying force magnitudes, as well as three scenes of soccer balls being poked upwards with varying force magnitudes.







Force Prompting Can Recreate Some Demos for
Prior Works that Use a Physics Simulator at Inference

To demonstrate the point force model's versatility, we curate a benchmark using first-frame images from some prominent physics-in-the-loop papers. We are not claiming that the Force Prompting method outperforms those methods on visual fidelity or physical realism. Rather, we wish to illustrate that our purely neural method can handle some of the same visual scenarios almost as effectively as approaches which require some combination of 3D assets and explicit physics simulation at inference time.

Recreating a PhysDreamer (ECCV 2024) demo

Recreating a DreamPhysics (AAAI 2025) demo

Recreating a MotionCraft (NeurIPS 2024) demo

Recreating a PhysGaussian (CVPR 2024) demo

Recreating a PhysGen (ECCV 2024) demo

Recreating a Physics3D demo

Recreating a PhysMotion demo

Recreating a PhysGen3D demo



Hints at Mass Understanding

The same force results in different motion depending on the object's inferred mass

Single book vs. stack of books

Empty laundry basket vs. full laundry basket

Single cube vs. stack of cubes

Wooden ornament vs. metal ornament

Analysis of Effect of Text Keywords on Generalization

We find that the usage of standard keywords (e.g. wind/blow/breeze) at train time are crucial for generalization of the wind model. Interestingly whether they are present or not at inference time does not seem to matter significantly. We hypothesize that using these keywords at train time allows the model to connect the conditioning signal with these keywords and the video distributions they represent.

"Wind" keyword is important at train time but not at inference time

Analysis of Effect of Visual Diversity on Generalization

Our main finding is the surprising generalization given limited paired data; however this generalization still requires strategically selecting certain types of visual diversity. Here we ablate several of these types of diversity and their effects when they are removed. While we find the generalization ability promising, we also believe that more diverse training data will improve the robustness of the model.

Background Diversity

Number of Flags for Wind

Number of Balls for Poke



Limitations


Failure Case #1: The Physics is Out-of-Domain for the Base Video Model

The dust is blown in the prompted direction, but the base video model has difficulty generating a physically plausible person-plow-ground interaction

The kite is blown in the prompted direction, but the base video model has difficulty generating a physically plausible video of a kite dragging a person

The egg rolls in the prompted direction, but the base video model has difficulty rolling non-spherical objects, so the egg appears to float

Failure Case #2: The Base Video Model's Prior Competes with the Force Prompt

The rocking chair moves in the prompted direction, but the base video model has trouble distinguishing between foreground and background objects

The rubber duck moves in the prompted direction but bobs up and down due to the base model's prior. Also, all objects move because the base model struggles with object atomicity for complex scenes

The confetti moves in the prompted direction, but the base video model forces the scene to conjure extra confetti



Computational Resources

Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, making these techniques broadly accessible for future research.

BibTeX

@misc{gillman2025forcepromptingvideogeneration,
      title={Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals}, 
      author={Nate Gillman and Charles Herrmann and Michael Freeman and Daksh Aggarwal and Evan Luo and Deqing Sun and Chen Sun},
      year={2025},
      eprint={2505.19386},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.19386}, 
}