Argunauts 训练阶段 I:在合成数据上进行持续预训练

社区文章 发布于 2025 年 2 月 18 日

简介

本文档记录了我们如何通过对 Llama-3.1-8B-Instruct 进行持续预训练来创建 DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT

这样做的目标是让 Llama-3.1-8B-Instruct 熟悉 Argdown 语法和语义,同时不削弱模型的其他能力。

训练 Llama-3.1-Argunaut-1-8B-SFT 是 Argunauts 项目的一部分。

SFT 数据集混合

训练数据包含 1M 个示例,由以下数据集组成:

数据集 权重(示例) 权重(token)
DebateLabKIT/deepa2-conversations 25% 49%
DebateLabKIT/deep-argmap-conversations 25% 18%
allenai/tulu-3-sft-mixture 50% 33%

成分 #1:deepa2-conversations

deepa2-conversations 数据集包含多轮对话,具有用户/助手角色(长度:2x2–20x2 轮,呈钟形分布)。用户逐步指示助手重构并分析最初给定的论证文本。

它是一个完全合成的数据集,回收利用了 deepa2 数据

为了构建 deepa2-conversations,我们定义了 30 种不同类型的工件,这些工件可以在分析过程中被引用或产生,其中一些用于注释和分析源文本,以标准形式重构论证,或将前提和结论形式化,而另一些则使用重构结果进行额外任务(例如编写 Z3 程序来检查演绎有效性,或使用函数调用结果增强论证)。

用户可能引用/助手可能在论证分析过程中产生的特征有:

源文本工件

N = "narrative"  # free form text that presents the argument as narrative, dialogque, letter, or email  
ST = "source_text"  # original and more schematic "source text" containing the (incomplete) argument and potential distractors
SP = "source_paraphrase"  # paraphrase of source text which leaves out extraneous material and streamlines the presentation

非正式分析工件

T = "title"  # short title of the argument 
G = "gist"  # key point summarizing the argument
TG = "title_gist"  # mixes title and gist
TGC = "title_gist_conclusion"  # mixes title, gist and conclusion in argdown snippet
X = "context"  # multi-purpose characterization of context

论证文本分析工件

R = "reasons"  # list of quotes from `ST` that justify claims
J = "conjectures"  # list of quotes from `ST` that are argued for
AT = "annotated_source_text"  # argumentative text annotation of source text
ATR = "annotated_source_text_ref"  # like AT, with reference to argdown reconstruction
IG = "inference_graph"  # renders all reasons and conjectures as mini argument map

论证重构工件

A = "argdown reconstruction"  # premise-conclusion structure
E = "erroneous_argdown"  # flawed reconstruction, to-be revised at later step
P = "premises"  # list of premises
I = "intermediary_conclusions"  # list of intermediary conclusions
C = "conclusion"  # final conclusion of argument
AM = "argument_map"  # map with subarguments, premises and conclusions 
ATG = "argdown_with_title_gist"  # argdown argument with header

逻辑分析工件

FP = "premises_formalized"  # formalizations of premises
FI = "intermediary_conclusions_formalized"  # formalizations of intermediary conclusions
FC = "conclusion_formalized"  # formalizations of final conclusion
PP = "predicate_placeholders"  # list of predicate placeholders
PE = "entity_placeholders"  # list of entity names
PM = "misc_placeholders"  # further placeholders (e.g., propositional constants)
K = "plchd_substitutions"  # keys, mapping placeholders to natural-language terms
AF = "argdown_with_formalizations"  # argdown argument with inline formalization info

杂项工件

SMTL = "SMT_program_local"  # smtlib code to check validity of each subargument with Z3
SMTG = "SMT_program_global"  # smtlib code to check global validity with Z3
AY = "argdown_augmented_yaml"  # argdown argument augmented with fake inline yaml from tool calls
QA = "evaluation_question_and_answer"  # argument analysis Q&A

并非所有工件都出现在所有对话中。

deepa2-conversations 中的每个对话都展开了一个有意义的零碎重构过程,用户在其中引导助手进行分析,指示其逐一生成工件。

但什么是有意义的重构过程,我们如何模拟它?

有意义的重构过程有两种基本约束:

  1. 冗余。如果一个工件已经被先前生成的工件所包含,那么生成它就是冗余的。例如:如果助手已经生成了完整的(正确的)前提-结论结构,那么要求助手只生成(正确的)结论就没有意义,因为它已经包含在我们已有的论证中。
  2. 依赖。某些工件依赖于其他工件。例如,通过注释识别源文本中的所有原因并将其链接到 Argdown 论证中对应的论证前提,需要在此之前论证已被重构。

此外,论证分析教学经验表明,在给定的重构状态下,某些步骤比其他步骤更合理。例如,在重构和形式化整个论证之前先处理文本会更有前景。

我们通过以下方式建模这些想法:

  • 在所有工件类型节点集上构建一个加权有向图,加上一个起始节点和一个终止节点(对话图),以及
  • 一组额外的合法路径约束。

下图描绘了整个对话图的子图

image/png

一个抽象的 deepa2 对话是从起始节点到终止节点的有向路径,它满足合法路径的额外约束。我们现在可以随机采样抽象的合法对话路径。展开以查看两个示例:

在此对话中,用户呈现源文本 (ST),然后要求助手...
  1. 总结论证(标题和要点),
  2. 非正式地映射源文本中包含的所有理由和推测性陈述,
  3. 将论证重构为前提-结论结构,
  4. 起草 SMTLIB 代码片段,用于检查论证的前提是否演绎地推导出其结论。

image/png

在此对话中,用户呈现源文本并指示助手...

  1. 将整个论证重构为前提-结论结构,
  2. (根据重构)转述源文本,
  3. 总结论证的标题和关键点,
  4. 通过标记原因和推测来注释源文本。

image/png

一切就绪后,deepa2-conversations 数据集的构建就变得简单了。对于每个分割:

  1. 我们设置分割大小。
  2. 我们遍历回收利用的 deepa2 数据,直到生成所需数量的示例
    • 随机采样抽象对话路径
    • 使用模板渲染当前 deepa2 记录中的论证工件
    • 在对话中用论证工件替换抽象工件类型

成分 #2:deep-argmap-conversations

Argunauts-SFT 混合的第二个成分基于最近发布的 syncIALO 数据集——一个包含 1k+ 论证图谱的语料库,每个图谱包含数百个论断。

deep-argmap-conversations 中的对话通常有 2 或 4 轮,涉及用户和助手。用户指示助手创建 Argdown 论证图。虽然输入和具体任务可能不同,但构建此数据集的一般思路如下:

  1. 从随机选择的 syncIALO 辩论中抽取一个子图(不超过 20 个论证)。
  2. 子图渲染为整洁且一致的 Argdown 论证图(full_argdown_map)。这将作为ANSWER
  3. 子图转换为输入,其中包含的信息少于子图
  4. 输入插入合适的指令模板中,该模板给出INSTRUCTION
  5. INSTRUCTION - ANSWER 对被渲染为对话。

展开以查看所有子任务和示例模板的列表(为说明目的而样式化和缩短)。

使用论证主张的乱序列表作为输入

- role: user
  content: >
    Organize the following sentences into an informal argument map (argdown syntax)
    according to their pairwise dialectic relations.
    sentences:
    {{ sentence_list }}
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

要求助手向部分/不完整的论证图添加句子

- role: user
  content: >
    Organize the following sentences into an informal argument map according to
    their pairwise dialectic relations. Some of these dialectic relations are
    shown below (partialmap) and should be taken into account.
    sentences:
    {{ sentence_list }}
    partialmap:
    ```argdown
    {{ partialmap }}
    ```
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```\n
- role: user
  content: >
    Complete and revise the following argument map by adding the missing sentences.
    incomplete argument map:
    ```argdown\n{{ partialmap }}\n```
    missing sentences:
    {{} sentence_list }}
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

将两个或多个图合并为一个

- role: user
  content: >
    Merge the following argument maps into a single map. Streamline and add missing
    arguments as required.{% if hints | title %} In doing so, pay attention to the
    given hints.
    {% for partialmap in partialmaps %}
    argument map {{loop.index}}:
    ```argdown
    {{ partialmap }}
    ```
    {% endfor %}
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

brokenmap是通过随机扭曲正确地图获得的

- role: user
  content: >
    Improve the following argument map, if possible. In doing so, pay attention to the given hints.
    argument map:
    ```argdown\n{{ brokenmap }}\n```
    {% if hints %}
    hints:\n {{ hints }}
    {% endif %}
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

以一个自由源文本作为起点

- role: user
  content: >
    Reconstruct the following pros and cons list as an argument map (using argdown syntax).
    pros and cons:
    ```argdown\n{{ proscons }}\n```
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

使用自由源文本作为输入

- role: user
  content: >
    Lay out the arguments presented in the following material as an informal argument map (argdown syntax).
    <material>
    {{ source_text }}
    </material>
    <context>
    {{ context }}
    </context>
    Just provide the argument map in your answer.
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

要求助手向论证图添加单个论断(前提,结论)

- role: user
  content: >
    Your task is to expand the following argument map by adding the
    premises / conclusion(s) of the arguments mentioned below as additional
    claims.
    argument map:
    ```argdown\n{{ partialmap }}\n```
    Insert the premises / conclusion(s) of 
    the following arguments:
    {{ unfolded_nodes_labels }}
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

最后,一项任务是将完整的 Argdown 图作为输入

- role: user
  content: >
    Your task is to reconstruct a single argument from the following argument map and render
    it in standard form (premise-conclusion structure).
    argument map:\n\n
    ```argdown\n{{ full_argdown_map }}\n```\n\n
    argument to reconstruct:\n\n
    ```argdown\n{{ argument_label }}\n```\n
- role: assistant
  content: >
    ```argdown\n{{ argument_argdown_snippet }}\n```

通过将这些不同的管道应用于 syncIALO,我们创建了一个多样化的对话数据集,其中包含要求苛刻的论证映射任务。

成分 #3:Tülu 3 SFT 混合

我们慷慨地将 AI2 的 Tülu 3 SFT 混合添加到我们的训练数据中,以防止遗忘宝贵的技能(又称“灾难性遗忘”)。

训练

从概念上讲,我们将此 SFT 训练阶段视为持续预训练,而非微调。主要目的是使模型熟悉 Argdown 语法和语义。这也是我们使用 打包和梯度累积来增加有效批处理大小的原因。由于 deepa2-conversations 中的某些对话最多有 20x2 轮,我们选择使用 8192 上下文窗口进行训练。

🤔 为什么我们不只训练补全?

每个用户查询都是给定分析状态下合理的指令,模型应该学习。(但我认为测试掩蔽效果是值得的。)

🤔 聊天模板和持续预训练——这不是很奇怪吗?

也许这不常见,但我认为聊天模板应该在预训练期间广泛使用(也许在“教科书就是你所需要的一切”项目中也是如此)。在未来的迭代中,我将使用各种不同的聊天模板和格式来增加持续预训练期间的多样性。

技术上,训练栈是用以下工具构建的:

基础模型:meta-llama/Llama-3.1-8B-Instruct

训练参数和设置

参数
学习率 5e-6
学习率调度器 余弦
上下文窗口大小 8192
有效批量大小 32
deepspeed ZeRO 第三阶段
deepspeed CPU 卸载
频谱最高%信噪比目标 30%

在 2 个 H100 上训练了 35 小时。

评估

聊天体验

让我分享一下我与 ggbetz/Llama-3.1-Argunaut-1-8B-SFT-Q4-mlx 进行的最初两次聊天。

👤 **您**
您好,您熟悉 Argdown 吗?

🤖 **助手**
是的,我熟悉 Argdown。它是一种用于推理的标记语言,允许您以自然易读的方式编写结构化文本。使用 Argdown,您可以创建论证图,其中句子通过代表不同关系(例如,支持、驳斥、理由)的线条连接。这使得可视化复杂的论证结构变得容易。

您想创建 Argdown 论证图吗?请随意提供主题和/或一些句子,我会帮助您设置图!

👤 **您**
请您将以下文本映射出来好吗?

校服可以阻止犯罪并提高学生安全。在加利福尼亚州长滩,实施全区 K-8 强制校服政策两年后,该区学校的袭击和殴打报告下降了 34%,使用致命武器的袭击下降了 50%,打架事件下降了 51%,性犯罪减少了 74%,抢劫下降了 65%,持有武器(或“类似武器”)下降了 52%,持有毒品下降了 69%,故意破坏行为下降了 18%。[64]

内华达州斯帕克斯中学实施校服政策一年后,学校警方数据显示,警方日志报告下降了 63%,帮派活动、学生打架、涂鸦、财产损失和殴打也都有所减少。一项同行评审研究发现,有校服政策的学校的枪支相关事件比没有校服的学校少 12%,毒品相关事件少 15%。[25][69]

此外,校服还可以防止学生将武器藏在宽松的衣服下,便于在实地考察时追踪学生,并使校园内的入侵者更显眼。新泽西州洛迪学区的督学弗兰克·夸特罗内表示:“当学生穿着统一时,您会让他们更安全。如果有人进入建筑物,入侵者很容易被识别出来。”[6][38]

此外,校服为学生创造了一个公平的竞争环境,减少了同伴压力和欺凌。当所有学生都穿着统一时,学生之间因服装选择而产生的竞争以及对那些穿着廉价或不时髦服装的学生进行嘲笑的情况可以消除。Schoolwear Association 的研究发现,83% 的教师认为“一套好的校服……可以防止基于外貌或经济背景的欺凌”。北德克萨斯大学家长教育中心创始人兼主任 Arminta Jacobson 表示,校服“让所有孩子在外观上处于同一起跑线上。我认为这可能会给他们一种归属感和被社会接受的感觉。”[5][91]

🤖 **助手**

School uniforms deter crime and increase student safety.
    <+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
    <+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
    <+ School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
    <+ Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]

👤 **您**
请用您自己的话总结文本中提出的主要论点。请勿超过 2-3 句话。

🤖 **助手**
校服是解决学校犯罪和安全问题的有效方案,这已由实施校服政策的各机构事件减少的情况所证明。

👤 **您**
让我们专注于这个论点。为了记录,结论是什么?

🤖 **助手**
文本认为
校服可以阻止犯罪并提高学生安全。

通过检查此论证,我们现在可以着手重构它。请提供此推理指南的下一步

👤 **您**
您能用 Argdown 语法将此论证重构为前提-结论结构吗?

🤖 **助手**
该论证可重构如下:

School uniforms deter crime and increase student safety.
    <+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
    <+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
    <+ School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
    <+ Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]

🛑 答案很差

那不是标准形式的重构,只是重复了前面给出的图。

👤 **您**
不,我真的希望你以标准形式重构论证,列出前提和结论。

🤖 **助手**

School uniforms deter crime and increase student safety.
    <+ <Subargument 1>
        <+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
        <+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
        <+ School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
        <+ Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]

🛑 答案很差

尽管用户指示,助手仍未能纠正答案。

我在第二次聊天中探究了分析前提-结论结构的能力:展开阅读。

👤 **您**
您好,您熟悉 Argdown 并将论证呈现为前提-结论结构吗?

🤖 **助手**
是的,我熟悉 Argdown,并且可以将论证呈现为前提-结论结构。您想讨论什么或者想让我演示什么?

👤 **您**
您能描述一下以下文本中提出的主要论点吗?

校服可以阻止犯罪并提高学生安全。在加利福尼亚州长滩,实施全区 K-8 强制校服政策两年后,该区学校的袭击和殴打报告下降了 34%,使用致命武器的袭击下降了 50%,打架事件下降了 51%,性犯罪减少了 74%,抢劫下降了 65%,持有武器(或“类似武器”)下降了 52%,持有毒品下降了 69%,故意破坏行为下降了 18%。[64]

内华达州斯帕克斯中学实施校服政策一年后,学校警方数据显示,警方日志报告下降了 63%,帮派活动、学生打架、涂鸦、财产损失和殴打也都有所减少。一项同行评审研究发现,有校服政策的学校的枪支相关事件比没有校服的学校少 12%,毒品相关事件少 15%。[25][69]

此外,校服还可以防止学生将武器藏在宽松的衣服下,便于在实地考察时追踪学生,并使校园内的入侵者更显眼。新泽西州洛迪学区的督学弗兰克·夸特罗内表示:“当学生穿着统一时,您会让他们更安全。如果有人进入建筑物,入侵者很容易被识别出来。”[6][38]

此外,校服为学生创造了一个公平的竞争环境,减少了同伴压力和欺凌。当所有学生都穿着统一时,学生之间因服装选择而产生的竞争以及对那些穿着廉价或不时髦服装的学生进行嘲笑的情况可以消除。Schoolwear Association 的研究发现,83% 的教师认为“一套好的校服……可以防止基于外貌或经济背景的欺凌”。北德克萨斯大学家长教育中心创始人兼主任 Arminta Jacobson 表示,校服“让所有孩子在外观上处于同一起跑线上。我认为这可能会给他们一种归属感和被社会接受的感觉。”[5][91]

🤖 **助手**
校服不仅是机构营造统一氛围的传统而受人尊敬的方式,而且还有助于学生的安全和表现。实施校服可以显著降低犯罪率,改善学生行为,并增强校园安全。

👤 **您**
您能将此论证重构为标准形式(前提-结论结构)吗?

🤖 **助手**

(1) Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
(2) The reduction in crime rates is a result of the uniform policy and not just a coincidence.
(3) The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.
(4) The reduction in crime rates is a result of the uniform policy and not just a coincidence.
-----
(5) Implementing school uniforms is an effective way to deter crime and increase student safety.

🛑 答案很差

机械地模仿训练数据(deep-argmap)中的论证重构:无意义地复制一个命题以获得 4 个前提。

👤 **您**
您能否添加中间结论并添加推理规则?

🤖 **助手**
当然,以下是推导出最终结论的中间步骤:

Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
The reduction in crime rates is a result of the uniform policy and not just a coincidence.
The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.
Implementing school uniforms is an effective way to deter crime and increase student safety.

👤 **您**
请以 Argdown 格式呈现精炼后的重构。

🤖 **助手**
该论证可表示如下:

[Conclusion 1]: Implementing school uniforms is an effective way to deter crime and increase student safety.
    <+ <Subargument 1>
        <+ [Premise]: Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
        <+ [Premise]: The reduction in crime rates is a result of the uniform policy and not just a coincidence.
        <+ [Premise]: The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.

👤 **您**
作为前提-结论结构,并添加推理信息呢?

🤖 **助手**

[Conclusion 1]: Implementing school uniforms is an effective way to deter crime and increase student safety.
    <+ <Subargument 1>
        <+ [Premise]: Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
        <+ [Premise]: The reduction in crime rates is a result of the uniform policy and not just a coincidence.
        <+ [Premise]: The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.

🛑 答案很差

模型未能遵循用户指示,未完善前提-结论分析。

从 Argdown 语法上看,这一切都很好,但我的主观印象是,模型在持续预训练期间一直遵循它所看到的模式,并且在被指示执行偏离合成训练语料库中内容的 Argdown 相关操作时,它会有些顽固。

指标

我们评估 DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT 在三个基准上的性能:

Argdown Bench 是一个新创建的测试套件。它包含来自 deep-argmap-conversations 测试拆分的论证映射任务。我们使用自己的 Argdown 解析器解析模型完成,并评估模型的答案是否

  1. 在语法上是有效的 Argdown 代码 (pass@x),
  2. 与黄金答案接近(由 networkx 的图距离度量),
  3. 与黄金答案完全同构(图距离为零)。

我们在自己的设置(vLLM 推理服务器)上运行所有评估,得到:

Llama-3.1-8B-Instruct Argunaut-1-8B-SFT
⬇️ Argdown 基准
pass@1 80.8 98.9
pass@5 98.9 99.8
图相似度 8.4 65.5
同构 1.3 44.0
🤗 HF 排行榜
MMLU pro 37.6 34.5
MUSR 40.1 41.6
GPQA 钻石 32.8 30.8
GPQA 主榜 28.5 32.1
MATH 12.5 9.1
BBH 54.7 48.2
⛓️ COT 排行榜
LogiQA 5.9 1.4
LogiQA2 15.5 0.8
LSAT-ar 11.7 3.0
LSAT-lr 20.8 3.9
LSAT-rc 27.5 13.8

Argdown 基准的结果有力证实 Argunaut-1-8B-SFT 确实学习了 Argdown 语法和语义。这与我们在上面聊天中观察到的合法 Argdown 代码片段一致。

HF 排行榜的结果表明,这些通用多项选择任务的性能略有下降,但肯定不是灾难性的。这里没有技能的灾难性损失。

🛑 然而,COT 排行榜的结果却显示出 CoT 效率的灾难性下降:Argunaut-1-8B-SFT 似乎失去了基础模型通过 CoT 推理提高答案准确性的能力。

后续步骤

  • 识别超越我们合成的 Argunauts SFT 混合的训练数据:多样化、高质量的逻辑分析演示,特别是示范先前重构的连续修订以及 Argdown 的语法灵活性;
  • 使用在线 DPO 反复训练 Argunaut-1-8B-SFT。

社区

注册登录 发表评论