基于低延迟 CPU 的通用教育价值分类器

社区文章发布于 2024 年 6 月 12 日

1. 动机

目前出现了一种新兴趋势，即像 Phi-3[1]、Llama3[2]、Mistral-7B[3] 这样的语言模型变得越来越小，同时变得更智能。特别是，Phi-3 技术报告[1]中引入了“数据最优机制”，专注于数据质量，这与专注于训练最优模型大小和令牌数量的“计算最优机制”形成对比。

受 Textbooks Are All You Need[4] 的启发，该研究开发了一个分类器来预测代码数据集的教育价值，并将其用于数据过滤，显著提升了模型性能。我们的动机是构建一个轻量级分类器，能够预测来自网络的任何文档的教育价值。

我们的贡献包括：

⚡发布基于低延迟 CPU 的教育价值分类器（“分类器”），用于过滤预训练数据集，以在相同的训练令牌下获得更好的 LLM 性能，并具有一个宽松定义/通用教育价值；
⚡发布基于训练数据集HuggingFaceFW/fineweb-edu-llama3-annotations的fineweb-edu-fasttext-classifier，该数据集具有明确定义的教育价值；
📊详细分析由于提示差异导致的两种分类器在教育价值标注上的差异；
🔎探索利用教育价值分类器在预训练前评估预训练数据集以及在互联网上挖掘高教育价值领域的可能性，这得益于其低延迟特性。

2. 数据集构建

为了构建用于训练分类器的数据集，我们必须在有限的计算预算内确保数据集的多样性。选择 Phi-3-mini-128k-instruct[1] 是因为它计算高效，并且在小型模型尺寸下展现出卓越的推理和语言理解性能。MiniPile[5] 被用作训练和测试数据集，因为它通过聚类和人工排除构建。尽管只用 100 万个文档进行训练，但它在 GLUE 上的性能下降微乎其微。

提示

Task: Classify if the provided context has High or Low educational value for a student. Label is either High or Low.


Context: {text}
Label:<|end|>
<|assistant|>

我们没有明确定义教育价值，因为它涉及主观性，而且我们不确定小型语言模型的能力。

“高”和“低”令牌的延续对数被用来构建一个二元分类问题。P(高教育价值) = Logit("高")/(Logit("高")+Logit("低"))

之后，使用概率创建3个标签，因为它提供了更高的教育价值粒度。

高 (教育价值前 25%)
中 (教育价值中间 25-75%)
低 (教育价值后 25%) 在推理过程中，教育价值的计算如下：教育价值 = 2 * P(高) + 1 * P(中) + 0 * P(低)

3. 模型训练

fastText[6] 被选作建模方法，其中词表示被平均并输入到线性层进行分类，因为它足够快，可以处理具有数十亿甚至数万亿词元预训练数据。

4. 评估

4.1 分类器评估

由于分类器用于对文本数据进行排名，而非分类数据，因此测量了斯皮尔曼等级相关系数。教育价值与测试数据之间的系数为 0.7055，表明存在强烈的单调关系。

4.2 分析

4.2.1 人工检查

predict_education_value(['''Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises due to the structure of arguments alone, independent of their topic and content. Informal logic is associated with informal fallacies, critical thinking, and argumentation theory. It examines arguments expressed in natural language while formal logic uses formal language. When used as a countable noun, the term "a logic" refers to a logical formal system that articulates a proof system. Logic plays a central role in many fields, such as philosophy, mathematics, computer science, and linguistics.'''])
# Output [1.9266871362924576]
predict_educational_value(['''"Attention Is All You Need" is a landmark[1][2] 2017 research paper authored by eight scientists working at Google, responsible for expanding 2014 attention mechanisms proposed by Bahdanau et al. into a new deep learning architecture known as the transformer. The paper is considered by some to be a founding document for modern artificial intelligence, as transformers became the main architecture of large language models.[3][4] At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but even in their paper the authors saw the potential for other tasks like question answering and for what is now called multimodal Generative AI.[5]'''])
# Output [1.8226698189973831]
predict_educational_value(['''A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process.[1] LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.[2]'''])
# Output [1.7609568238258362]
predict_educational_value(['''In Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the size (capacity, complexity, expressive power, richness, or flexibility) of a class of sets. The notion can be extended to classes of binary functions. It is defined as the cardinality of the largest set of points that the algorithm can shatter, which means the algorithm can always learn a perfect classifier for any labeling of at least one configuration of those data points. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis.[1]'''])
# Output [1.589950144290924]
predict_educational_value(['''The query vector is compared (via dot product) with each word in the keys. This helps the model discover the most relevant word for the query word. In this case "girl" was determined to be the most relevant word for "that". The result (size 4 in this case) is run through the softmax function, producing a vector of size 4 with probabilities summing to 1. Multiplying this against the value matrix effectively amplifies the signal for the most important words in the sentence and diminishes the signal for less important words.[5] The structure of the input data is captured in the Wq and Wk weights, and the Wv weights express that structure in terms of more meaningful features for the task being trained for. For this reason, the attention head components are called Query (Wq), Key (Wk), and Value (Wv)—a loose and possibly misleading analogy with relational database systems.'''])
# Output [1.4657384157180786]
predict_educational_value(['''The Arsenal Football Club (commonly known as simply Arsenal) is an English professional football club based in Holloway, North London. Arsenal compete in the Premier League, the top flight of English football. In domestic football, Arsenal has won 13 league titles (including one unbeaten title), a record 14 FA Cups, two League Cups, 17 FA Community Shields, and a Football League Centenary Trophy. In European football, they have one European Cup Winners' Cup and one Inter-Cities Fairs Cup. In terms of trophies won, it is the third-most successful club in English football.[2]'''])
# Output [1.1015518307685852]
predict_educational_value(['''The 2003–04 season was Arsenal Football Club's 12th season in the Premier League and their 78th consecutive season in the top flight of English football.[3][4] It began on 1 July 2003 and concluded on 30 June 2004, with competitive matches played between August and May. The club ended the Premier League campaign as champions without a single defeat – a record of 26 wins and 12 draws. Arsenal fared less well in the cups, eliminated in the FA Cup and League Cup semi-finals to Manchester United and Middlesbrough respectively, and at the quarter-final stage of the UEFA Champions League to Chelsea.'''])
# Output [1.0146622359752655]
predict_educational_value(['''As both teams' first-choice kits featured a shade of red, Arsenal wore their yellow away strip, while Barcelona wore their traditional blue and maroon striped kit. Arsenal won the coin toss and Barcelona kicked off.[21] Barcelona almost immediately came under pressure when Thierry Henry shot straight at Barcelona goalkeeper Víctor Valdés, who conceded a corner. From the resulting corner Arsenal had another chance again courtesy of Henry, whose shot was again saved by Valdés. The next attack in the seventh minute resulted in Arsenal goalkeeper Jens Lehmann saving from Ludovic Giuly after he shot from a narrow angle. Four minutes later Barcelona were awarded a free-kick 35 yards from goal; Ronaldinho shot wide of the goal.'''])
# Output [0.7897453680634499]

从人工检查来看，可以看出模型喜欢科学知识。它也对阿森纳足球俱乐部感兴趣，但是，它认为特定比赛的摘要没有很好的教育价值。维基百科中的文档并不意味着它具有很高的教育价值。

4.2.3 有无分类器的模型训练

为了验证使用分类器过滤数据能在相同训练令牌下带来更好的性能，我们训练了两个 192M 模型，均进行了 6000 个全局步骤的训练。

任务	FineWeb 有过滤训练	FineWeb 无过滤训练	使用 Cosmopedia 训练
arc-easy	37.37	34.97	37.45
arc-challenge	23.55	22.95	23.21
Hellaswag	28.02	27.92	27.78
MMLU	24.71	23.94	24.65
TruthfulQA	45.88	45.20	45.97
Winogrande	49.49	50.59	50.67

推理和常识推理能力在开启过滤器后似乎有所提高，符合预期。它也接近 Cosmopedia 的表现。
MMLU也表现更好；然而，由于计算限制（训练时间和模型大小），它接近随机水平。
将训练更大尺寸的模型以进一步验证此主张。

4.2.4 网站域名分析

预计大多数教育价值来自大学/学校、研究机构和组织的网站。
由于 HuggingFaceFW/fineweb 包含爬取网站的 URL，因此可以计算每个域名的平均教育价值。
已分析前 10M 条记录。完整文件请点击此处。

以下是记录数 >= 100 的前 100 个域名。

4.2.5 现有预训练数据集

该分类器适用于各种数据集，且结果符合预期。

总的来说，合成数据具有更高的教育价值，因为它们在设计时就具有高教育价值。
对于真实数据，HuggingFaceFW/fineweb 和 Dolma v1_7（其中应用了此处所述的质量过滤器）在所有真实数据中拥有最高的教育价值。
一般来说，数据集发布得越晚，其教育价值就越高，因为研究社区对数据质量的关注度日益增加。
教科书类别（大多是合成的）得分最高，因为它们是为教育价值而创建的，这反映了该模型的有效性。
数学/论文类别得分次高，因为它知识密度高。
维基百科得分相对较低，因为它也包含教育价值较低的信息（例如比赛结果、电影明星奖项）。
网络得分较低（如果未应用过滤），因为它包含所有领域。
模因（Meme）得分最低，符合预期。仇恨模因几乎得到零分。

事实上，推断出具有更高教育价值的预训练数据能在基准测试中带来更好的 LLM 性能，这并不令人惊讶。因此，通过合理数量的实验运行，研究人员和从业者甚至可以在训练前通过建立性能与教育价值的回归分析来预测基准性能。

存在两个计算瓶颈，即模型训练计算和教育价值推理计算。通过所提出的分类器，第二个瓶颈得以消除，该分类器能够以每秒超过 2000 个文档的吞吐量对海量数据进行推理。

数据集	采样	平均教育价值	类型
SciPhi/textbooks-are-all-you-need-lite	前 100,000	1.846	合成
nampdn-ai/tiny-orca-textbooks	前 100,000	1.673	合成
HuggingFaceTB/cosmopedia stanford	前 100,000	1.673	合成
vikp/textbook_quality_programming	前 100,000	1.663	合成
HuggingFaceTB/cosmopedia web_samples_v1	前 100,000	1.618	合成
nampdn-ai/tiny-textbooks	前 100,000	1.586	合成
HuggingFaceTB/cosmopedia web_samples_v2	前 100,000	1.562	合成
HuggingFaceTB/cosmopedia openstax	前 100,000	1.462	合成
HuggingFaceTB/cosmopedia wikihow	前 100,000	1.422	合成
HuggingFaceTB/cosmopedia khanacademy	前 100,000	1.419	合成
HuggingFaceTB/cosmopedia auto_math_text	前 100,000	1.347	合成
armanc/scientific_papers pubmed	前 100,000	1.260	真实数据
HuggingFaceTB/cosmopedia stories	前 100,000	1.154	合成
teknium/OpenHermes-2.5	前 100,000	1.121	合成
timdettmers/openassistant-guanaco	前 100,000	1.115	真实数据
open-web-math/open-web-math	前 100,000	1.089	真实数据
armanc/scientific_papers arxiv	前 100,000	1.068	真实数据
HuggingFaceFW/fineweb	前 100,000	1.056	真实数据
NousResearch/dolma-v1_7-305B*	前 100,000	1.037	真实数据
tatsu-lab/alpaca	前 100,000	1.020	合成
BEE-spoke-data/fineweb-100k_en-med	前 100,000	1.019	真实数据
JeanKaddour/minipile	前 100,000	0.998	真实数据
togethercomputer/RedPajama-Data-V2 en 2023-06	前 100,000	0.985	真实数据
wikipedia en 20220301	前 100,000	0.975	真实数据
Replete-AI/code_bagel	前 100,000	0.950	合成
allenai/c4 en	前 100,000	0.934	真实数据
mattymchen/refinedweb-3m	前 100,000	0.857	真实数据
iamtarun/python_code_instructions_18k_alpaca	前 100,000	0.849	合成
tiiuae/falcon-refinedweb	前 100,000	0.835	真实数据
BEE-spoke-data/FineMeme-100k	前 100,000	0.716	真实数据
neuralcatcher/hateful_memes	前 100,000	0.070	真实数据
* 我们遇到了一个问题，因此无法处理原始的allenai/dolma。

4.2.6. 使用 HuggingFactTB/fineweb-edu-classifier 进行基准测试

我们的工作是独立于 fineweb-edu-classifier 进行的，因为我们的模型于 2024 年 5 月中旬发布。很高兴看到 HuggingFace FineWeb-Edu 验证了我们最初的研究目标，即使用教育价值分类器训练能够在相同的训练令牌下获得更好的 LLM 性能，并且性能提升幅度比我们由于预算限制所能达到的更大。

尽管两者都旨在对文档的教育价值进行分类，但值得注意其中的差异，如下表所示：

	HuggingFaceTB/fineweb-edu-classifier (fineweb-edu-classifier)	kenhktsui/llm-data-textbook-quality-fasttext-classifier-v2（本分类器）
训练数据集	FineWeb 中的样本 (45 万)	MiniPile (1 百万)
LLM 注释中标签的粒度	6 个类别（明确定义）	2 个类别（通用）
标签构建	LLM 注释	LLM 续写对数
标注模型	Llama3-70B-instruct	Phi-3-mini-128k-instruct
建模方法	基于 Transformer 的分类头模型	fastText 文本分类

fineweb-edu-classifier 通过训练更大的语言模型并在不同的基准上进行评估，因此对照 fineweb-edu-classifier 验证我们的模型是有帮助的。

4.2.6.1 MiniPile 测试数据集

在 MiniPile 测试集上，本分类器与 fineweb-edu-classifier 之间的斯皮尔曼相关系数为 0.4108。当我们计算本分类器预测的平均教育分数，并按 fineweb-edu-classifier 预测的分数进行分组时，可以看出本分类器能够很好地区分 class<2 和 class >=2，但无法很好地区分 class 2、class 3 和 class 4。

为了进一步验证这一说法，我们将其表述为一个二元分类问题，其中

如果 fineweb-edu-classifier 预测为 [0, 1]，则标签为 0
否则标签为 1。宏平均 F1 分数为 0.67。

我们发布了一个基准数据集，其中包含 MiniPile 中每个文档的两个模型的预测结果，以便感兴趣的读者可以比较结果的异同。

4.2.6.2 FineWeb-Edu 数据集

我们进一步在 FineWeb-Edu 的前 100,000 条记录上应用了该分类器。平均教育价值为 1.37，这使得 FineWeb-Edu 成为 4.2.5 节中得分最高的真实数据集。

教育价值的分布是右偏的，87.18% 的记录教育价值 >= 1.0，这意味着如果应用我们的分类器，我们将保留 87.18% 的数据。

4.2.6.3 fastText 与基于 Transformer 的模型对比

为了解方法论差异对预测差异的贡献，另一个 fastText 分类器（“fineweb-edu-fasttext-classifier”）在 HuggingFaceFW/fineweb-edu-llama3-annotations 上进行了训练。

标签	kenhktsui/fineweb-edu-fasttext-classifier	HuggingFaceFW/fineweb-edu-classifier
0	0.55	0.59
1	0.80	0.81
2	0.50	0.59
3	0.39	0.53
4	0.06	0.44
5	0.00	0.02

标签 0、1、2 与原始模型具有可比性。性能下降在标签 3 中开始变得明显，并在标签 4 中进一步扩大，这是由于 fastText 模型的容量有限。这与 4.2.6.1 节中的观察结果一致。

模型	MiniPile 测试中的斯皮尔曼相关性
fineweb-edu-fasttext-classifier	0.5832
llm-data-textbook-quality-fasttext-classifier-v2	0.4108

fineweb-edu-fasttext-classifier 与 HuggingFaceFW/fineweb-edu-classifier 在 MiniPile 测试集上的斯皮尔曼相关系数为 0.5832，但在相同训练数据下并未更高。主要原因是 fastText 模型由于容量有限，无法很好地捕捉最高的教育价值。其余差异可归因于 4.2.6 节中描述的训练数据集、标签构建和标注模型。

4.2.6.4 教育价值标注中的提示差异

有 1,778 条记录，我们的分类器预测教育价值 >=1，而 fineweb-edu-classifier 预测为 [0, 1]。为了隔离标注模型差异，我们用我们的提示和 fineweb-edu-classifier 的提示对 Phi-3-mini-4k-instruct 进行了提示。

在可提取分数的记录中，45% 保持与 Llama-3-70B-Instruct 相同的评分，33% (13%) 给出高 1 分 (低 1 分) 的评分，这反映了 Phi-3-mini-4k-instruct 和 Llama-3-70B-Instruct 之间的标注模型差异。

其余差异可归因于教育价值的定义。经检查，这些是 fineweb-edu-classifier 预测分数较低的原因，这与它们提示的特异性相符。

复杂度不适合小学生或幼儿园学生
不符合教育标准或不提供适合小学或幼儿园水平的大量学习材料分类器使用了一种更通用和隐式的提示，没有给出明确的教育价值标注说明，这不仅限于小学或幼儿园学生；也不强制遵守教育标准。

教育价值的不同定义（宽松定义和明确定义）解释了大部分差异。哪个分类器更好可能并非普遍适用，因为它取决于具体的用例。在某些情况下，最好的方法可能在于两者的结合或更多。

有关完整数据集，请参阅 kenhktsui/edu-value-annotation-difference-hf-edu-score-le2-tbq-v2-score-ge1。

4.2.7 分类器的局限性

已知该分类器无法检测幻觉，并且在非网络数据上表现不佳，因为它并非为此类数据训练。

5. 讨论和未来工作

过去，主流做法是扩大语言模型，然后扩大数据以实现 SOTA 结果。现在看到越来越多的努力投入到数据质量上，而不仅仅是扩大模型参数，这非常令人欣慰。

低延迟分类器和 fineweb-edu-fasttext-classifier 提供了一种有前景的方法，可以 1) 以低成本和可扩展的方式过滤数据集，以及 2) 在预训练之前大规模评估预训练数据集，这将帮助计算资源较少的研究人员和从业人员以更高效的方式训练大型/小型语言模型。

我们期待研究界未来将投入更多精力在数据质量方面，并且有几个方向值得探索。

教育价值的定义：如4.2.6.4节所示，教育价值是一个非常主观的问题，因为它因人而异。例如，对于会计师来说，机器学习知识的教育价值可能不如国际财务报告准则高。我们的尝试力求尽可能隐式，以便捕捉学生的“平均”教育价值。事实上，它应该高度个性化。

教育价值的规模法则：随着更多实验的可用，预训练前已知教育价值，元分析可作为预测 LLM 性能的代理在训练之前。教育价值不仅促进了数据质量的关注和标准化，还促进了 LLM 的个性化。

主动抓取和数据许可：与其被动依赖 Common Crawl 的快照（它只是网络数据的一个子集，URL 百分比与教育价值无关），不如在识别出高教育价值的领域后进行主动抓取和许可。第 4.2.4 节提供了一个起点。

多语言和多模态：没有理由不将训练数据教育价值越高，模型性能越高的发现扩展到其他语言和多模态。

小型和大型语言模型的局限性：给定一个完美的教育数据集，小型语言模型能走多远？给定一个完美的教育数据集，大型语言模型又能走多远？

6. 参考文献

[1] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Chen Liang, Weishung Liu, Eric Lin, Zeqi Lin, Piyush Madan, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Xia Song, Masahiro Tanaka, Xin Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Michael Wyatt, Can Xu, Jiahang Xu, Sonali Yadav, Fan Yang, Ziyi Yang, Donghan Yu, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, and Xiren Zhou. Phi-3 technical report: A highly capable language model locally on your phone, 2024.
[2] https://github.com/meta-llama/llama3
[3] Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
[4] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
[5] Jean Kaddour. The minipile challenge for data-efficient language models. arXiv preprint arXiv:2304.08442, 2023.
[6] Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. Bag of tricks for efficient text classification. arXiv preprint arXiv:1610.08229, 2016.

引用

如需引用此博客，请使用

@misc{ktsui2024cpueduvalue,
      title={Low Latency CPU Based Educational Value Classifier With Generic Educational Value}, 
      author={Ken Tsui and Huu Nguyen},
      year={2024},
}

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录评论