Cosmos Predict-2：您专属世界模型的基础

社区文章发布日期：2025年6月17日

构建更智能的机器人和自动驾驶汽车（AV）始于理解现实世界动态的基础模型。这些模型扮演着两个关键角色：

加速**合成数据生成（SDG）**，以教会机器现实世界的物理和交互——包括罕见的边缘情况。
作为**基础模型**，可针对专门任务进行后期训练或适应不同的输出类型。

**Cosmos Predict-1**正是为此目的而构建的，用于生成逼真、物理感知的未来世界状态。

现在，**Cosmos Predict-2**在速度、视觉质量和定制化方面带来了重大升级。

🔭 Cosmos Predict-2 简介

Cosmos Predict-2 是我们**用于物理 AI 的顶尖世界基础模型**，其架构经过优化，提升了速度、可扩展性，并支持跨用例和硬件平台的分辨率和帧率灵活性。

提供两种模型变体，针对任务复杂度进行优化：

Cosmos Predict-2 2B
推理速度快，内存占用低。适用于原型开发、低延迟应用和边缘部署。
Cosmos Predict-2 14B
专为高保真世界建模、复杂场景理解、扩展时间一致性和提示精度而设计。

开发者可以使用机器人或AV环境的参考图像来生成一致、物理精确的世界状态视频，从而开始使用video2world模型。text-to-image模型还可以从文本提示创建预览图像。

📐 分辨率和帧率选项

Cosmos Predict-2 提供灵活的输出格式：

分辨率
- 720p
- 480p，以实现更快的吞吐量
帧率
- 可用：10 fps，16 fps
- 即将推出：24 fps（10 Hz 模拟和 AV 训练管道的理想选择）

⚙️ 推理和性能

Cosmos Predict-2 旨在实现跨硬件设置的**快速、灵活推理**。

**2B 变体**：在 NVIDIA GB200 NVL72、DGX B200、RTX PRO 6000 或 RTX 6000 Ada 上实现快速性能。
**14B 变体**：在 GB200/B200 系统上实现更高保真度和复杂、时间一致性任务。

import torch
from imaginaire.utils.io import save_image_or_video
from cosmos_predict2.configs.base.config_video2world import PREDICT2_VIDEO2WORLD_PIPELINE_2B
from cosmos_predict2.pipelines.video2world import Video2WorldPipeline

# Create the video generation pipeline.
pipe = Video2WorldPipeline.from_config(
    config=PREDICT2_VIDEO2WORLD_PIPELINE_2B,
    dit_path="checkpoints/nvidia/Cosmos-Predict2-2B-Video2World/model-720p-16fps.pt",
    text_encoder_path="checkpoints/google-t5/t5-11b",
)

# Specify the input image path and text prompt.
image_path = "assets/video2world/example_input.jpg"
prompt = "A high-definition video captures the precision of robotic welding in an industrial setting. The first frame showcases a robotic arm, equipped with a welding torch, positioned over a large metal structure. The welding process is in full swing, with bright sparks and intense light illuminating the scene, creating a vivid display of blue and white hues. A significant amount of smoke billows around the welding area, partially obscuring the view but emphasizing the heat and activity. The background reveals parts of the workshop environment, including a ventilation system and various pieces of machinery, indicating a busy and functional industrial workspace. As the video progresses, the robotic arm maintains its steady position, continuing the welding process and moving to its left. The welding torch consistently emits sparks and light, and the smoke continues to rise, diffusing slightly as it moves upward. The metal surface beneath the torch shows ongoing signs of heating and melting. The scene retains its industrial ambiance, with the welding sparks and smoke dominating the visual field, underscoring the ongoing nature of the welding operation."

# Run the video generation pipeline.
video = pipe(input_path=image_path, prompt=prompt)

# Save the resulting output video.
save_image_or_video(video, "output/test.mp4", fps=16)

👉 完整设置说明：nvidia-cosmos/cosmos-predict2 GitHub 仓库

🛠️ 针对您的用例对 Cosmos Predict-2 进行后期训练

Cosmos Predict-2 可以**后期训练**用于机器人、自动驾驶汽车和工业自动化等定制应用。后期训练的用例包括：

领域	硬件专用操作	示例应用
机器人	指令控制，物体操作	摘取茎部强度不同的苹果
自动驾驶汽车	多视图生成，边缘案例模拟。	雨天高速公路驾驶，带有激光雷达/摄像头同步
工业	行动条件工作流	传送带机器人预测性维护
视觉	相机姿态条件	从单张图像生成3D一致性视频

假设您需要对上表中的第一个示例执行后期训练。

第 1 步：使用开源数据策展工具准备数据

为您的任务收集 100 多个小时的遥控操作代表性数据
使用 Cosmos Curate 处理、分析和组织视频内容。
下一步您将需要准确的文本+视频配对。

第 2 步：后期训练模型

参考训练脚本
使用策展数据在您的机器人/环境上进行微调。

第 3 步：生成合成场景

示例提示："在弱光下拾取碰伤的苹果"
您也可以通过图像进行条件设置

第四步：使用Cosmos Reason验证
Cosmos Reason 是一个**时空感知物理 AI 推理模型**，它根据训练数据质量标准对合成视频数据进行评估。样本评估提示包括：

✅ 机器人是否正确抓取苹果？
✅ 关节角度是否在安全范围内？
✅ 是否有运动伪影或物体碰撞？

Cosmos Predict-2 后期训练样本：

Cosmos-Predict2-14B-Video2World-Sample-GR00T-Dreams-GR1：基于视频+文本的未来视觉世界生成，在GR00T GR1数据上进行了后期训练

Cosmos-Predict2-14B-Video2World-Sample-GR00T-Dreams-DROID：基于视频+文本的未来视觉世界生成，在GR00T DROID数据上进行了后期训练

🧠 立即试用 Cosmos Predict-2

Cosmos Predict-2 现已在 Hugging Face 上提供。有关详细设置说明，请浏览 GitHub 仓库：nvidia-cosmos/cosmos-predict2

了解更多：NVIDIA Cosmos

加入我们的社区，获取学习内容、直播和讨论：NVIDIA Omniverse 社区

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录发表评论