页面到视频：从网页生成视频 🪄🎬

社区文章发布于2025年5月6日

总结：我们制作了一个用于将网页转换为带幻灯片的教育视频的应用程序。

网络上充满了知识，但通常它们都隐藏在文章和页面中。如果能自动将这些内容转换为引人入胜的视频课程，那不是很酷吗？这正是实验性的 `page-to-video` 空间所做的事情！

起源故事

在过去的几个月里，我们的团队以创纪录的速度发布了关于LLM、推理、微调、推理模型和代理等主题的课程。通过分享这些课程，我们了解到人们希望有更多的方式来快速学习和消化信息。阅读密集的文章固然很好，但有时视频更能让人理解，对吧？此外，制作高质量的视频内容需要时间。

然而，这个领域发展飞速，录制视频需要时间才能做得好。如果我们能即时为自己生成视频，那不是很酷吗？

这显然带来了一些质量问题。学习很耗时，我们不希望人们学到错误的知识。因此，我们为应用程序设定了以下先决条件。

成本效益高，以便人们可以生成自己的视频。
基于文本，以便轻松定义所说和所显示的内容。
源代码控制，以便社区可以进行更改。

考虑到这一点，我们采取了一种直接的方法，结合文本转录、Markdown 幻灯片和文本转语音来制作带有音频和幻灯片的视频。

那么，什么是 `page-to-video`？

`page-to-video` 接收一个网页的 URL，并返回一个带有音频描述和幻灯片的视频，内容是关于该页面的。以下是步骤：

你提供一个网页的URL

`page-to-video` 从网站收集 HTML，并删除文本、图像和表格等实体，以便它们可以在转录和幻灯片中处理。

生成幻灯片和文字稿。

首先，应用程序使用Cohere的`CohereLabs/c4ai-command-a-03-2025`模型在推理提供商上从网页生成文字稿。这种摘要和格式更改通常对于LLM来说是直截了当的，因为它们需要以另一种格式呈现输入文本。在这种情况下，是口语课程。

有了文字稿，使用 command-a 生成像 Marp 这样的 Markdown 格式的幻灯片。这允许用户以文本形式编辑幻灯片和文字稿，并完全控制他们的视频。

💡提示：page-to-video 还会返回幻灯片的 PDF 版本，以防你想从这里开始！

从内容生成语音

接下来，我们使用 Fal ai 平台和 Dia-1.6B 模型根据文字稿生成语音。这会为每张幻灯片创建语音片段，您可以通过应用程序界面进行查看。

💡 提示：Minimax 可以克隆声音，所以如果你复制空间，克隆你的声音，并将 `VOICE_ID` 参数添加到环境变量中，你就可以用你自己的声音生成视频！

合并为视频格式

最后，应用程序将使用 ffmpeg 将幻灯片图像和音频文件组合成视频格式。这对于一个 5 分钟的片段来说大约需要一分钟来加载，但没有 AI 推理，所以它是免费的。

它是如何工作的？我如何构建自己的？

在这里，我将对应用程序进行拆解，并重点介绍您可以在自己的项目中重复使用的关键 AI 方面。

幻灯片和文字稿

该应用程序建立在推理提供商之上，它利用推理提供商进行 LLM 调用。

page-to-video 不会尝试为每个步骤（文本理解、图像生成、视频生成）构建和托管大型 AI 模型，它只使用通过 Hub 上的 API 提供的服务。

幻灯片生成提示


LLM_MODEL = "CohereLabs/c4ai-command-a-03-2025"  # Model ID
PRESENTATION_PROMPT_TEMPLATE = """
You are an expert technical writer and presentation creator. Your task is to convert the
following web content into a complete Remark.js presentation file suitable for conversion
to PDF/video.

**Input Web Content:**

{markdown_content}

**Available Images from the Webpage (Use relevant ones appropriately):**

{image_list_str}

**Instructions:**

1.  **Structure:** Create slides based on the logical sections of the input content.
    Use headings or distinct topics as indicators for new slides. Aim for a
    reasonable number of slides (e.g., 5-15 depending on content length).
2.  **Slide Format:** Each slide should start with `# Slide Title`.
3.  **Content:** Include the relevant text and key points from the input content
    within each slide. Keep slide content concise.
4.  **Images & Layout:**
    *   Where appropriate, incorporate relevant images from the 'Available Images'
        list provided above.
    *   Use the `![alt text](url)` markdown syntax for images.
    *   To display text and an image side-by-side, use the following HTML structure
        within the markdown slide content:
        ```markdown
        .col-6[
            {{text}}  # Escaped braces for Python format
        ]
        .col-6[
            ![alt text](url)
        ]
        ```
    *   Ensure the image URL is correct and accessible from the list. Choose images
        that are close to the slide's text content. If no image is relevant,
        just include the text. Only use images from the provided list.
5.  **Presenter Notes (Transcription Style):** For each slide, generate a detailed
    **transcription** of what the presenter should say, explaining the slide's
    content in a natural, flowing manner. Place this transcription after the slide
    content, separated by `???`.
6.  **Speaker Style:** The speaker notes should flow smoothly from one slide to the
    next. No need to explicitly mention the slide number. The notes should
    elaborate on the concise slide content.
7.  **Separators:** Separate individual slides using `\\n\\n---\\n\\n`.
8.  **Cleanup:** Do NOT include any specific HTML tags from the original source webpage
    unless explicitly instructed (like the `.row`/`.col-6` structure for layout).
    Remove boilerplate text, navigation links, ads, etc. Focus on the core content.
9.  **Start Slide:** Begin the presentation with a title slide based on the source URL
    or main topic. Example:
    ```markdown
    class: impact

    # Presentation based on {input_filename}
    ## Key Concepts

    .center[![Hugging Face Logo](https://huggingface.co/front/assets/huggingface_logo.svg)]

    ???
    Welcome everyone. This presentation, automatically generated from the content at
    {input_filename}, will walk you through the key topics discussed. Let's begin.
    ```
10. **Output:** Provide ONLY the complete Remark.js Markdown content, starting with
    the title slide and ending with the last content slide. Do not include any
    introductory text, explanations, or a final 'Thank You' slide.
11. **Conciseness:** Keep slide *content* (the part before `???`) concise (bullet
    points, short phrases). Elaborate in the *speaker notes* (the part after `???`).

**Generate the Remark.js presentation now:**
"""

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cohere",
    api_key="hf_xxxxxxxxxxxxxxxxxxxxxxxx",
)

completion = client.chat.completions.create(
    model="CohereLabs/c4ai-command-a-03-2025",
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ],
)

print(completion.choices[0].message)

文本转语音

我最喜欢推理提供商的一点是，你可以从中开发成熟的项目。因此，一旦你在 Hub 上比较了模型和提供商，你就可以专注于一个特定的组合并将其嵌入到你的应用程序中。

我们对文本转语音组件就是这样做的。在试验了 `fal-ai/minimax-tts` 之后，我们最终使用了 `Dia-1.6B` 模型。你可以通过推理提供商试用这些模型。

一旦LLM创建了文字稿，你可以像这样将其传递给文本转语音模型

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="fal-ai",
    api_key="hf_xxxxxxxxxxxxxxxxxxxxxxxx",
)

# audio is returned as bytes
audio = client.text_to_speech(
    "The answer to the universe is 42",
    model="nari-labs/Dia-1.6B",
)

Fal ai 也提供了一些方便的客户端，可以像这样使用：

import fal_client

def on_queue_update(update):
    if isinstance(update, fal_client.InProgress):
        for log in update.logs:
            print(log["message"])


result = fal_client.subscribe(
    "fal-ai/minimax-tts/text-to-speech/turbo",
    arguments={
        "text": "Hello, world!",
        "voice_setting": {"speed": 1.0, "emotion": "happy"},
        "language_boost": "English",
    },
    with_logs=True,
    on_queue_update=on_queue_update,
)