PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

PhyPrompt is a novel reinforcement learning framework that automatically refines text prompts to generate physically plausible videos from text-to-video models. The system uses a two-stage approach involving a fine-tuned 7B-parameter LLM and Group Relative Policy Optimization with dynamic reward curriculum, achieving a 40.8% joint success rate on the VideoPhy2 benchmark—an 8.6 percentage point improvement over baseline methods. Despite its smaller size, PhyPrompt-7B outperforms larger models like GPT-4o and DeepSeek-V3 on physical commonsense reasoning tasks.

PhyPrompt: RL-based Prompt Refinement for Physically Plausible Text-to-Video Generation

Researchers have developed a novel AI framework, PhyPrompt, that automatically refines text prompts to generate more physically realistic videos, addressing a critical weakness in state-of-the-art text-to-video models. This work shifts the focus from scaling model parameters to optimizing the input prompt itself, demonstrating that specialized, smaller models can outperform vastly larger general-purpose systems on domain-specific tasks like physical commonsense reasoning.

Key Takeaways

  • Current text-to-video (T2V) generators often produce high-quality visuals that violate basic physical laws, a problem traced to insufficient physical detail in user prompts rather than inherent model limitations.
  • PhyPrompt is a two-stage reinforcement learning framework that first fine-tunes a 7B-parameter LLM on physics reasoning and then uses a novel Group Relative Policy Optimization with a dynamic reward curriculum to refine prompts.
  • The system achieves a 40.8% joint success rate on the VideoPhy2 benchmark, an 8.6 percentage point gain, improving physical commonsense by 11pp and semantic adherence by 4.4pp simultaneously.
  • Despite its small size, PhyPrompt-7B outperforms GPT-4o and the much larger DeepSeek-V3 on this task, and shows zero-shot transferability across multiple T2V architectures like LAVIE and VideoCrafter2.
  • The research establishes that domain-specialized reinforcement learning with compositional curricula can surpass general-purpose model scaling for improving physics-aware generation.

How PhyPrompt Engineers Physical Realism

The core innovation of PhyPrompt lies in its structured approach to prompt engineering, which is treated as a learnable optimization problem. The framework operates in two distinct stages. First, a large language model, specifically a 7-billion parameter variant, is fine-tuned on a custom Chain-of-Thought dataset focused on physics principles. This teaches the model to decompose a user's intent—like "a ball rolls down a hill"—into underlying concepts of object motion, force interactions, material properties, and environmental constraints.

The second stage employs a reinforcement learning (RL) technique called Group Relative Policy Optimization (GRPO). This is where the system learns to generate the optimal final prompt. The key to its success is a dynamic reward curriculum. Initially, the RL agent is heavily rewarded for maintaining semantic fidelity to the user's original request. As training progresses, the reward function progressively shifts to prioritize physical commonsense and plausibility. This curriculum avoids the common multi-objective trade-off, enabling the model to discover compositional prompts that excel at both tasks simultaneously.

The results on the VideoPhy2 benchmark are compelling. PhyPrompt-7B achieves a 40.8% joint success rate, marking an 8.6 percentage point improvement. Crucially, it boosts physical commonsense accuracy from 55.8% to 66.8% (+11pp) while also increasing semantic adherence from 43.4% to 47.8% (+4.4pp). This synergistic improvement demonstrates the framework's ability to go beyond simple trade-offs. Furthermore, the refined prompts show remarkable zero-shot transferability, improving video generation across diverse architectures like LAVIE, VideoCrafter2, and CogVideoX-5B by up to 16.8%, proving the solution is model-agnostic.

Industry Context & Analysis

This research arrives at a pivotal moment in the text-to-video race, where visual quality has seen rapid gains—evidenced by models like OpenAI's Sora, Runway's Gen-2, and Pika Labs—but fundamental reasoning about the physical world remains a glaring weakness. The industry's dominant approach has been to scale data and parameters, exemplified by models like DeepSeek-V3 with its rumored hundreds of billions of parameters. PhyPrompt's results challenge this paradigm head-on. By outperforming DeepSeek-V3 on the VideoPhy2 benchmark using a model roughly 100 times smaller, it provides strong evidence that for specific capability gaps like physics, specialized algorithmic innovation can be more effective than brute-force scaling.

The technical implication is significant: the "garbage in, garbage out" principle is acutely relevant for generative video. A model like Sora may have latent physical understanding, but it cannot manifest it without the correct cues. PhyPrompt effectively acts as a dedicated "reasoning co-pilot" for video generation, bridging the gap between user intent and model capability. This follows a broader trend of using smaller, specialized models to guide or correct larger foundational models, seen in areas like code generation (e.g., using a small model for test-case generation to improve a larger model's output).

From a competitive standpoint, PhyPrompt's method is orthogonal to the efforts of major labs. While OpenAI, Google, and Meta invest in building video foundational models with better world understanding from pretraining, this work offers a practical, deployable tool that can augment existing models immediately. Its benchmark performance against GPT-4o (+3.8% joint success) is particularly notable, as it suggests that even the most advanced general-purpose LLMs are not optimally configured for the precise task of physics-aware prompt refinement.

What This Means Going Forward

The immediate beneficiaries of this technology are developers and platforms utilizing current T2V models. Integrating a framework like PhyPrompt could significantly enhance the usability and output quality of tools from Runway, Pika, or Stability AI without requiring retraining of the core video models. It effectively democratizes high-quality prompt engineering, which has been a specialized skill, making physically coherent video generation more accessible to average users.

Looking ahead, this research points to a future where generative AI pipelines become multi-stage. The first stage will involve intent refinement and constraint specification (PhyPrompt's role), followed by the actual media generation. We can expect a proliferation of similar specialized "pre-processor" models for other challenging domains like consistent character generation, accurate procedural knowledge (e.g., cooking steps), or temporal narrative coherence. The success of the dynamic reward curriculum also offers a blueprint for training AI systems on complex, compositional objectives beyond physics.

A key trend to watch is whether major model providers will internalize this approach. Will OpenAI build a "Sora-specific prompt optimizer," or will a third-party ecosystem of prompt-enhancement tools emerge? Furthermore, the next logical step is to close the loop: using the failures identified by benchmarks like VideoPhy2 to not only refine prompts but also to create targeted training data to fine-tune the video models themselves, leading to a virtuous cycle of improvement. PhyPrompt successfully redirects attention from the model to the message, proving that for achieving true world understanding in AI, how we ask the question is just as critical as the system answering it.

常见问题