syncIAL🍏:一个多用途合成辩论和论证映射语料库

社区文章 发布于2025年2月4日

syncIALO到底是什么?

syncIALO是一个合成论证映射数据集的集合。它的第一个主要语料库(平淡地命名为synthetic_corpus-001)包含:

  • 超过60万个论断(即论点),这些论断被组织在
  • >1000个论证图中.

syncIALO论证图是 有向图:节点代表论断,带标签的边表示一个论断支持或攻击另一个论断。

这些论证图可以使用networkx轻松加载和处理。

from huggingface_hub import hf_hub_download
import json
import networkx as nx
from pathlib import Path

path = Path(hf_hub_download(
    repo_id="DebateLabKIT/syncialo-raw",
    filename="data/synthetic_corpus-001/eval/debate-eval-0001/node_link_data-debate-eval-0001.json"))
argmap = nx.node_link_graph(json.loads(path.read_text()))

type(argmap)
# >>> networkx.classes.digraph.DiGraph

argmap.number_of_nodes()
# >>> 511

argmap.number_of_edges()
# >>> 510

next(iter(argmap.nodes.data()))[1]
# >>> {'claim': 'Governments should provide substantial financial
# >>> incentives to families with children to counteract declining
# >>> population growth and mitigate the long-term consequences on
# >>> societal stability and progress.',
# >>>  'label': 'Pay to Populate'}

让我向您展示一个从训练集中随机抽样的syncIALO辩论的子图,以Argdown格式呈现:

[Learning Over Leisure]: Schools should restrict students' access to fan fiction and social media to protect the integrity of education. 
    <- <Restriction Infringes on Freedom of Expression>: Restricting access to fan fiction and social media unconstitutionally limits students' right to freedom of expression and stifles their creativity.
        <+ <Lifelong Learning>: By exercising their freedom of expression, students develop essential skills in critical thinking, problem-solving, and effective communication, preparing them for success in their future careers and personal lives.
        <- <Echo Chamber Effect>: Exercising freedom of expression in an unstructured environment can create an echo chamber where students only communicate with like-minded individuals, failing to develop the skills to engage with diverse perspectives and opposing views.
            <- <Silent Observer>: Developing skills to engage with diverse perspectives and opposing views is not essential for effective communication in situations where listening and observing, rather than actively engaging, is the most effective strategy.
        <- <Fan Fiction Distortion>: Fan fiction and social media often distort students' creativity by promoting unoriginal and copyrighted content, rather than fostering genuine artistic expression.
            <- <Artistic Evolution>: The value of artistic expression lies in its ability to evoke emotions and spark new ideas, regardless of whether it is original or builds upon existing works, making the distinction between original and unoriginal content irrelevant.
        <+ <Innovation Incubator>: Unrestricted freedom of expression enables students to develop critical thinking, problem-solving, and communication skills, essential for academic and professional success.
    <+ <Focus on Fundamentals-1>: Restricting access to fan fiction and social media in schools allows students to prioritize core academic subjects and develop a solid foundation in STEM fields, literature, and critical thinking.
    <+ <Focus on Fundamentals-2>: By limiting access to non-academic online content, schools can redirect students' attention to foundational subjects, fostering a stronger understanding of complex concepts and better retention of critical information.
        <+ <Knowledge Pyramid>: A strong grasp of foundational subjects allows students to recognize relationships between different ideas and concepts, creating a hierarchical structure of knowledge that enhances retention and recall of critical information.

image/png

我可以用它做什么?

原始的syncIALO非常适合“提炼”更具体的数据集。

  • 您可以使用syncIALO构建用于预训练、SFT、DPO或RLVR的数据集
  • 您可以创建具有挑战性的基准来探究大型语言模型(LLM)的推理能力。
  • 您可以创建定制的少样本示例,用于使用LLM生成论证图。
  • 您可以将syncIALO数据用作种子,用于多智能体审议和LLM的个性化。

对于所有这些,您都需要转换syncIALO辩论并提炼出更具体的任务。

首先,您可以采样子图并简单地将其口头化为对话,然后作为训练文本… 但我们可以通过利用syncIALO中包含的丰富信息做得更好。

这个创建推理任务的方法描述了一个更有趣的提炼过程:

  1. 采样一个子图(作为答案)。
  2. 处理或扭曲子图(作为输入参数)。
  3. 要求模型根据输入参数创建论证图(作为提示)。

例如:

提示 回答
这是一系列陈述……将它们重构为论证图! 论证图
考虑这三个图……将它们合并成一个论证图! 论证图
这是一个有缺陷的重构……修改并改进! 论证图

deep-argmap-conversations数据集已相应地进行了提炼,并进一步展示了可以从原始syncIALO创建的论证映射任务。

同样,您可以提炼DPO数据

提示 选择 拒绝
这是一系列陈述……将它们重构为论证图! 论证图 打乱的论证图
... ... ...

如果您指示LLM以可解析的格式(如yaml、mermaid或Argdown)生成论证图,那么您将有无限的可能性来验证解决方案并创建RLVR数据

提示 奖励
这是一系列论断……将它们重构为yaml论证图! 有效的yaml?
这是一系列论断……将它们重构为包含k个节点的论证图! 包含k个节点的有效yaml?
... ...

(当然,syncIALO对于这种RLVR训练并非严格必要,但它可能会从多样且设计良好的syncIALO提示中获益。)

此外,可以轻松创建多项选择任务,如下所示:

提示 选项
考虑这个论证图……x(某个图属性)是什么? x=a,x=b…​
这是一系列陈述……哪张图能充分捕捉论证? a) 论证图,b) 打乱的图…​
... ...

这对于通过RL和可验证奖励来提高CoT/推理质量非常有用,当然也适用于基准测试LLM。

但是syncIALO在推理过程中也能提供帮助。假设您希望模型将给定文本重构为——比如说:Argdown——特定大小的论证图。如果模型遇到困难,您可以根据手头的问题无限地创建各种量身定制的少样本示例,从而引导模型。

角色数据集有助于增加合成数据集的多样性,在推理过程中进行广泛的解决方案空间探索,以及校准代理AI系统。syncIALO可以扮演类似的角色并补充现有角色数据集:例如,可以通过他们在辩论中采取的立场,或者他们提出、认可或批评的论点来进一步描述一个角色。

所以,syncIALO确实是多用途的。让我们一起探索您能用它做什么!

你是如何构建它的?

我们建立了一个动态管道,模拟了一个全面的论证映射过程。一个基于LLM的智能体模拟了一个批判性思考者,他寻找、评估和存储新的论点。

论证图通过不断向叶节点添加正反论点来递归构建,直到达到最大深度。人工智能代理识别目标论点A的前提,然后构思进一步支持或攻击A的论点。它根据显著性和多样性选择候选论点,并在将论点添加到论证图之前检查重复(通过语义相似性)。

为了增加多样性,我们通过随机选择多样标签云中的标签来抽样主题和动议。我们还让AI批判性思考者在生成新的候选论点时随机采用一个抽取的人格。

基于LLM的智能体由不同的❤️开源模型提供支持,具体取决于工作流步骤。我们使用meta-llama/Llama-3.1-405B来生成和评估论点,并使用经过微调的Llama-3.1-8B模型来处理要求较低的生成任务,例如格式化。MoritzLaurer/deberta-v3-large-zeroshot-v2.0作为我们的多用途分类器,我们使用sentence-transformers/all-MiniLM-L6-v2来生成句子嵌入。

该管道建立在❤️开源框架之上

我们正在发布syncIALO数据集以及同名的python包,我们已使用该包构建了合成数据集。

更广泛的背景是什么?

从哲学角度来看,syncIALO的灵感来自于莱尔的观点,即认知能力与论证性语言密切相关。一个人的认知能力在很大程度上在于能够根据逻辑、证据或科学推理的规范来产生言语。论证映射和批判性思维可能有助于人们在这个领域取得优异表现。这就是为什么它们可能为训练和探测人工智能系统提供有用的资源。

优秀的kialo.com项目因其设计了直观而有效的在线协作辩论/论证映射平台而受到赞誉。很高兴看到他们如此成功

Kialo 网站上积累的非正式论证图是自然语言处理研究人员、人工智能工程师、计算社会学家和批判性思维学者的金矿。然而,这个“矿”对他们来说是法律禁止的:在没有明确许可或授权协议的情况下,从网站下载或抓取的辩论数据不得用于研究或商业目的。

这是创建syncIALO语料库的另一个动机,它可能作为Kialo数据的替代品。(但很明显,syncIALO并非万能替代品:例如,一位经验研究人类实际如何论证的认知科学家,可能会发现syncIALO帮助不大。)

这个项目背后是谁?

syncIALO由KIT的DebateLab团队构思和构建。您可以在HuggingFaceGitHub上找到我们,或者关注我们的博客

🤗 Hugging Face通过推理时间/计算积分赞助了syncIALO项目。🙏我们衷心感谢他们的慷慨支持。🫶

我如何参与其中?

您可以帮助改进syncIALO并克服其当前的局限性,例如通过贡献管道来

  • 检查数据(论证关系、措辞、适当的标签)
  • 衡量局部和全局多样性(论断嵌入)
  • 发现并移除重复论断
  • 通过论证细化和重新连接构建改进版本

您也可以考虑

  • 创建新语料库(更改LLM、主题标签、图配置)
  • 翻译现有辩论语料库(我们已经有了一个管道)

然而,最重要的是,我们邀请您

  • 使用syncIALO进行构建分享您的工作

不要犹豫,随时联系我们!

社区

遵循你的例子
你的FoF-2 = FoF-1,这样会通过过度加权/饱和相同的论证为两个不同的论证来偏离数据集。

https://argdown.org/syntax/#equivalence-classes

它们应该看起来像这样

<Focus on Fundamentals>: Restricting access to fan fiction and social media in schools allows students to prioritize core academic subjects and develop a solid foundation in STEM fields, literature, and critical thinking.
<Focus on Fundamentals>: By limiting access to non-academic online content, schools can redirect students' attention to foundational subjects, fostering a stronger understanding of complex concepts and better retention of critical information.

导致如下结果

[Learning Over Leisure]: Schools should restrict students' access to fan fiction and social media to protect the integrity of education. 
    <- <Restriction Infringes on Freedom of Expression>: Restricting access to fan fiction and social media unconstitutionally limits students' right to freedom of expression and stifles their creativity.
        <+ <Lifelong Learning>: By exercising their freedom of expression, students develop essential skills in critical thinking, problem-solving, and effective communication, preparing them for success in their future careers and personal lives.
        <- <Echo Chamber Effect>: Exercising freedom of expression in an unstructured environment can create an echo chamber where students only communicate with like-minded individuals, failing to develop the skills to engage with diverse perspectives and opposing views.
            <- <Silent Observer>: Developing skills to engage with diverse perspectives and opposing views is not essential for effective communication in situations where listening and observing, rather than actively engaging, is the most effective strategy.
        <- <Fan Fiction Distortion>: Fan fiction and social media often distort students' creativity by promoting unoriginal and copyrighted content, rather than fostering genuine artistic expression.
            <- <Artistic Evolution>: The value of artistic expression lies in its ability to evoke emotions and spark new ideas, regardless of whether it is original or builds upon existing works, making the distinction between original and unoriginal content irrelevant.
        <+ <Innovation Incubator>: Unrestricted freedom of expression enables students to develop critical thinking, problem-solving, and communication skills, essential for academic and professional success.
    <+ <Focus on Fundamentals>: Restricting access to fan fiction and social media in schools allows students to prioritize core academic subjects and develop a solid foundation in STEM fields, literature, and critical thinking.
    <+ <Focus on Fundamentals>: By limiting access to non-academic online content, schools can redirect students' attention to foundational subjects, fostering a stronger understanding of complex concepts and better retention of critical information.
        <+ <Knowledge Pyramid>: A strong grasp of foundational subjects allows students to recognize relationships between different ideas and concepts, creating a hierarchical structure of knowledge that enhances retention and recall of critical information.

问题已解决,现在我们需要修复数据集

所有json都通过

#!/usr/bin/env python3
"""
Script to fix “almost duplicated” labels in a debate JSON.
It reads an input JSON file (with a “nodes” array where each node has a “label”),
finds labels that are very similar (according to a fuzzy–match threshold),
and then updates all such nodes to share a canonical label.
"""

import json
import sys
import logging
import argparse
from difflib import SequenceMatcher
from typing import List, Dict, Any

# Set up logging configuration
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

def similarity(a: str, b: str) -> float:
    """Return a similarity ratio between two strings (0 to 1)."""
    return SequenceMatcher(None, a, b).ratio()

def cluster_labels(labels: List[str], threshold: float = 0.90) -> Dict[str, str]:
    """
    Given a list of labels, return a dictionary mapping each label to a canonical label.
    Two labels that are at least 'threshold' similar will be treated as duplicates.
    (The first label encountered becomes the canonical version.)
    """
    canonical: Dict[str, str] = {}
    unique_labels = list(set(labels))  # unique labels in no particular order
    unique_labels.sort()  # sort for consistency

    # Build clusters by iterating over the unique labels.
    for i, label in enumerate(unique_labels):
        if label in canonical:
            continue
        canonical[label] = label  # label becomes its own canonical version
        for other_label in unique_labels[i + 1:]:
            if other_label in canonical:
                continue
            if similarity(label, other_label) >= threshold:
                canonical[other_label] = label
    return canonical

def fix_labels(data: Dict[str, Any], threshold: float = 0.90) -> Dict[str, Any]:
    """
    Given a debate JSON object (with a "nodes" key), fix labels by unifying similar ones.
    Returns the modified JSON object.
    """
    if "nodes" not in data:
        logging.error("No 'nodes' key found in JSON data.")
        return data

    nodes = data["nodes"]
    if not isinstance(nodes, list):
        logging.error("'nodes' should be a list.")
        return data

    # Extract all labels; if a node doesn't have a "label", default to an empty string.
    labels = [node.get("label", "") for node in nodes if isinstance(node, dict)]
    
    # Build mapping from each label to its canonical version.
    mapping = cluster_labels(labels, threshold=threshold)
    logging.info("Found %d unique labels; mapping to canonical labels:", len(mapping))
    for key, canonical_label in mapping.items():
        if key != canonical_label:
            logging.info("  %r --> %r", key, canonical_label)

    # Update each node's label using the mapping.
    for node in nodes:
        if isinstance(node, dict):
            original_label = node.get("label", "")
            if original_label in mapping:
                node["label"] = mapping[original_label]
    return data

def parse_args() -> argparse.Namespace:
    """Parse command-line arguments."""
    parser = argparse.ArgumentParser(
        description="Fix almost duplicated labels in a debate JSON file."
    )
    parser.add_argument("input_file", help="Path to the input JSON file.")
    parser.add_argument("output_file", help="Path where the fixed JSON will be saved.")
    parser.add_argument(
        "--threshold", type=float, default=0.90,
        help="Fuzzy matching threshold (default: 0.90)."
    )
    return parser.parse_args()

def main() -> None:
    args = parse_args()

    # Load JSON data from file with error handling.
    try:
        with open(args.input_file, "r", encoding="utf-8") as infile:
            data = json.load(infile)
    except FileNotFoundError:
        logging.error("Input file '%s' not found.", args.input_file)
        sys.exit(1)
    except json.JSONDecodeError as e:
        logging.error("Error decoding JSON from '%s': %s", args.input_file, e)
        sys.exit(1)
    except Exception as e:
        logging.error("An unexpected error occurred while reading '%s': %s", args.input_file, e)
        sys.exit(1)

    # Fix labels in the data.
    fixed_data = fix_labels(data, threshold=args.threshold)

    # Write the fixed data to the output file with error handling.
    try:
        with open(args.output_file, "w", encoding="utf-8") as outfile:
            json.dump(fixed_data, outfile, indent=2, ensure_ascii=False)
    except Exception as e:
        logging.error("An error occurred while writing to '%s': %s", args.output_file, e)
        sys.exit(1)

    logging.info("Fixed JSON written to '%s'", args.output_file)

if __name__ == "__main__":
    main()

https://huggingface.co/datasets/DebateLabKIT/syncialo-raw/raw/main/data/synthetic_corpus-001/train/debate-train-0444/node_link_data-debate-train-0444.json

我们得到这个 stdo

λ python fix_labels.py input.json output.json
INFO: Found 638 unique labels; mapping to canonical labels:
INFO:   'Algorithmic Bias Amplification' --> 'Algorithmic Amplification'
INFO:   'Biased Benchmarks' --> 'Biased Benchmark'
INFO:   'Crime Deterrent' --> 'Crime Deterrence'
INFO:   'Dataset Augmentation' --> 'Data Augmentation'
INFO:   'Data Deserts' --> 'Data Desert'
INFO:   'Diverse Datasets' --> 'Diverse Data Sets'
INFO:   'Surveillance Slippery Slope' --> 'Mass Surveillance Slippery Slope'
INFO:   'National Security Exemption' --> 'National Security Exception'
INFO:   'Protecting the Vulnerable:' --> 'Protecting the Vulnerable'
INFO:   'Redundant Safeguards' --> 'Redundancy Safeguard'
INFO: Fixed JSON written to 'output.json'

你所需要做的就是调整主程序并进行一次遍历。目前你的数据集不符合最佳实践。

鸣谢:我,argdown文档,以及人工智能进行[代码审查]和[错误处理]。

·
文章作者

你好,感谢你的仔细查看。对数据集进行去重处理并合并语义上相同的论点无疑是改进syncIALO的一个重要方向。也感谢你提供的具体代码。我担心相似度度量过于简单化,可能会将语义上不同甚至对立的论点标记为相同。无论如何,我建议你创建原始数据集的精炼版本,而不是更改此子集。:-)

注册登录 评论