Kurtis-E1.1：使用 Flower.ai 和 Hugging Face 对 Qwen2.5-3B-Instruct 进行监督微调

社区文章发布日期：2025年4月2日

Massimo Roberto Scamarcia

mrs83

很高兴分享 Kurtis-E1.1-Qwen2.5-3B-Instruct 的评估结果，它是 Qwen2.5-3B-Instruct 模型的微调变体。

我的目标是通过有针对性的微调来探索 SLM（小型语言模型）的能力，并为社区提供代码和基准数据。

什么是 Kurtis？

Kurtis 是我正在进行的努力的一部分，旨在探索小型、经过微调的语言模型如何在不依赖庞大基础设施的情况下，提供有意义、上下文感知和情商化的响应。

名称 ethicalabs.ai 不是一家公司，而是我选择发布开源作品的标签和非营利倡议。

它反映了我个人致力于开发注重透明度、可持续性和以人为本设计的软件的承诺。

该项目的一个核心主题也是探索人工智能/机器学习在辅助技术方面的应用，特别是针对边缘化社区、残疾人和神经多样性个体。

通过 Kurtis-E1.1-Qwen2.5-3B-Instruct，我使用 FlowerTune LLM 应用了联邦微调技术，以适应 3B 参数的基础模型。

选择 FlowerTune LLM 是因为它能够模拟去中心化、保护隐私的联邦 LLM 训练，非常适合重视社区协作、隐私和可持续性的项目。

目标不是与最大的模型竞争，而是展示深思熟虑的微调如何能让较小的系统仍能产生有用、可解释和与人类对齐的工具。

本报告分享了这一旅程的最新成果，我希望它能为道德、包容和高效的人工智能开发的更广泛讨论做出贡献。

为什么？

自闭症不是一种需要治愈的疾病。它是一种人类状况，一种谱系。真正需要“修复”的是社会拒绝接受多样性的态度。

太多神经多样性人群甚至不知道自己是自闭症患者：几十年来，我们只谈论缺陷，而不是差异。

这种沉默导致误诊、有害治疗和深深的痛苦，尤其是在污名阻碍公开谈论自闭症谱系的家庭中。

神经多样性可以被掩盖。但通常，它会变得可见，尤其是在缺乏支持的情况下。

我们中的一些人，在更优越的环境中成长，可以按照自己的节奏生活，而另一些人，被僵化的期望所压垮，他们的自闭症表现为倦怠、成瘾或危机。

这就是我从事这个项目的原因：Kurtis E1.1 并非追求基准至上或规模之争，而是专注于理解、可持续性和实际影响，尤其是在心理健康支持和更安全的人工智能交互等领域。

一款在您的设备上离线运行的隐私感知型本地助手。即使没有人可用，它也始终存在。

微调方法

该模型是使用来自 flower.ai 的 FlowerTune LLM 进行微调的，这是一个专为联邦微调设计的平台。
我利用了 FlowerTune LLM 的功能，对 Qwen2.5-3B-Instruct 基础模型进行特定任务的调整，同时保持了基础模型的性能。

数据集

该数据集融合了多样化的心理健康、哲学和通用问答数据集，以创建能够进行有意义对话和逻辑推理的全面模型。

https://huggingface.co/datasets/ethicalabs/Kurtis-E1-SFT

评估结果

评估任务在 NVIDIA A40 上使用 LM Evaluation Harness 进行。

hellaswag

lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct  --tasks hellaswag --device cuda:0 --batch_size 8

任务	版本	过滤器	n-shot	指标		值		标准误差
hellaswag	1	无	0	准确率	↑	0.5555	±	0.0050
		无	0	acc_norm	↑	0.7412	±	0.0044

arc_easy

lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct  --tasks arc_easy --device cuda:0 --batch_size 8

任务	版本	过滤器	n-shot	指标		值		标准误差
arc_easy	1	无	0	准确率	↑	0.7710	±	0.0086
		无	0	acc_norm	↑	0.6789	±	0.0096

arc_challenge

lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct  --tasks arc_challenge --device cuda:0 --batch_size 8

任务	版本	过滤器	n-shot	指标		值		标准误差
arc_challenge	1	无	0	准确率	↑	0.436	±	0.0145
		无	0	acc_norm	↑	0.448	±	0.0145

mmlu

lm_eval --model hf --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct  --tasks mmlu --device cuda:0 --batch_size 8

任务	版本	过滤器	n-shot	指标		值		标准误差
mmlu	2	无		准确率	↑	0.6522	±	0.0038
- 人文学科	2	无		准确率	↑	0.5734	±	0.0066
- 形式逻辑	1	无	0	准确率	↑	0.4603	±	0.0446
- 高中欧洲历史	1	无	0	准确率	↑	0.7939	±	0.0316
- 高中美国历史	1	无	0	准确率	↑	0.8333	±	0.0262
- 高中世界历史	1	无	0	准确率	↑	0.8397	±	0.0239
- 国际法	1	无	0	准确率	↑	0.7769	±	0.0380
- 法理学	1	无	0	准确率	↑	0.7963	±	0.0389
- 逻辑谬误	1	无	0	准确率	↑	0.7975	±	0.0316
- 道德争议	1	无	0	准确率	↑	0.6850	±	0.0250
- 道德情景	1	无	0	准确率	↑	0.2905	±	0.0152
- 哲学	1	无	0	准确率	↑	0.7106	±	0.0258
- 史前史	1	无	0	准确率	↑	0.7438	±	0.0243
- 专业法律	1	无	0	准确率	↑	0.4759	±	0.0128
- 世界宗教	1	无	0	准确率	↑	0.8246	±	0.0292
- 其他	2	无		准确率	↑	0.7087	±	0.0079
- 商业伦理	1	无	0	准确率	↑	0.7300	±	0.0446
- 临床知识	1	无	0	准确率	↑	0.7321	±	0.0273
- 大学医学	1	无	0	准确率	↑	0.6705	±	0.0358
- 全球事实	1	无	0	准确率	↑	0.3900	±	0.0490
- 人类衰老	1	无	0	准确率	↑	0.7130	±	0.0304
- 管理学	1	无	0	准确率	↑	0.7961	±	0.0399
- 市场营销	1	无	0	准确率	↑	0.8803	±	0.0213
- 临床遗传学	1	无	0	准确率	↑	0.7600	±	0.0429
- 杂项	1	无	0	准确率	↑	0.7957	±	0.0144
- 营养学	1	无	0	准确率	↑	0.7353	±	0.0253
- 专业会计	1	无	0	准确率	↑	0.5426	±	0.0297
- 专业医学	1	无	0	准确率	↑	0.6434	±	0.0291
- 病毒学	1	无	0	准确率	↑	0.4880	±	0.0389
- 社会科学	2	无		准确率	↑	0.7618	±	0.0076
- 计量经济学	1	无	0	准确率	↑	0.5439	±	0.0469
- 高中地理	1	无	0	准确率	↑	0.7677	±	0.0301
- 高中政治与政府	1	无	0	准确率	↑	0.8860	±	0.0229
- 高中宏观经济学	1	无	0	准确率	↑	0.6949	±	0.0233
- 高中微观经济学	1	无	0	准确率	↑	0.7773	±	0.0270
- 高中心理学	1	无	0	准确率	↑	0.8477	±	0.0154
- 人类性学	1	无	0	准确率	↑	0.7786	±	0.0364
- 专业心理学	1	无	0	准确率	↑	0.7075	±	0.0184
- 公共关系	1	无	0	准确率	↑	0.6818	±	0.0446
- 安全研究	1	无	0	准确率	↑	0.7224	±	0.0287
- 社会学	1	无	0	准确率	↑	0.8458	±	0.0255
- 美国外交政策	1	无	0	准确率	↑	0.8400	±	0.0368
- 科学、技术、工程、数学 (STEM)	2	无		准确率	↑	0.6070	±	0.0085
- 抽象代数	1	无	0	准确率	↑	0.4700	±	0.0502
- 解剖学	1	无	0	准确率	↑	0.6667	±	0.0407
- 天文学	1	无	0	准确率	↑	0.6776	±	0.0380
- 大学生物学	1	无	0	准确率	↑	0.7222	±	0.0375
- 大学化学	1	无	0	准确率	↑	0.5000	±	0.0503
- 大学计算机科学	1	无	0	准确率	↑	0.6000	±	0.0492
- 大学数学	1	无	0	准确率	↑	0.3400	±	0.0476
- 大学物理	1	无	0	准确率	↑	0.4902	±	0.0497
- 计算机安全	1	无	0	准确率	↑	0.7000	±	0.0461
- 概念物理	1	无	0	准确率	↑	0.6468	±	0.0312
- 电气工程	1	无	0	准确率	↑	0.6690	±	0.0392
- 初等数学	1	无	0	准确率	↑	0.5979	±	0.0253
- 高中生物学	1	无	0	准确率	↑	0.8129	±	0.0222
- 高中化学	1	无	0	准确率	↑	0.5813	±	0.0347
- 高中计算机科学	1	无	0	准确率	↑	0.7800	±	0.0416
- 高中数学	1	无	0	准确率	↑	0.5037	±	0.0305
- 高中物理	1	无	0	准确率	↑	0.4437	±	0.0406
- 高中统计学	1	无	0	准确率	↑	0.5972	±	0.0334
- 机器学习	1	无	0	准确率	↑	0.4554	±	0.0473

分组	版本	过滤器	指标		值		标准误差
mmlu	2	无	准确率	↑	0.6522	±	0.0038
- 人文学科	2	无	准确率	↑	0.5734	±	0.0066
- 其他	2	无	准确率	↑	0.7087	±	0.0079
- 社会科学	2	无	准确率	↑	0.7618	±	0.0076
- 科学、技术、工程、数学 (STEM)	2	无	准确率	↑	0.6070	±	0.0085

mmlu (5-shot)

lm_eval   --model hf   --model_args pretrained=ethicalabs/Kurtis-E1.1-Qwen2.5-3B-Instruct   --tasks mmlu   --device cuda:0   --batch_size 8 --num_fewshot 5

任务	版本	过滤器	n-shot	指标		值		标准误差
mmlu	2	无		准确率	↑	0.6629	±	0.0038
- 人文学科	2	无		准确率	↑	0.5862	±	0.0067
- 形式逻辑	1	无	5	准确率	↑	0.4683	±	0.0446
- 高中欧洲历史	1	无	5	准确率	↑	0.7818	±	0.0323
- 高中美国历史	1	无	5	准确率	↑	0.8284	±	0.0265
- 高中世界历史	1	无	5	准确率	↑	0.8692	±	0.0219
- 国际法	1	无	5	准确率	↑	0.7769	±	0.0380
- 法理学	1	无	5	准确率	↑	0.7963	±	0.0389
- 逻辑谬误	1	无	5	准确率	↑	0.8098	±	0.0308
- 道德争议	1	无	5	准确率	↑	0.7110	±	0.0244
- 道德情景	1	无	5	准确率	↑	0.3464	±	0.0159
- 哲学	1	无	5	准确率	↑	0.7042	±	0.0259
- 史前史	1	无	5	准确率	↑	0.7284	±	0.0247
- 专业法律	1	无	5	准确率	↑	0.4759	±	0.0128
- 世界宗教	1	无	5	准确率	↑	0.8304	±	0.0288
- 其他	2	无		准确率	↑	0.7171	±	0.0078
- 商业伦理	1	无	5	准确率	↑	0.7400	±	0.0441
- 临床知识	1	无	5	准确率	↑	0.7321	±	0.0273
- 大学医学	1	无	5	准确率	↑	0.6647	±	0.0360
- 全球事实	1	无	5	准确率	↑	0.4100	±	0.0494
- 人类衰老	1	无	5	准确率	↑	0.7220	±	0.0301
- 管理学	1	无	5	准确率	↑	0.7864	±	0.0406
- 市场营销	1	无	5	准确率	↑	0.8889	±	0.0206
- 临床遗传学	1	无	5	准确率	↑	0.7900	±	0.0409
- 杂项	1	无	5	准确率	↑	0.7957	±	0.0144
- 营养学	1	无	5	准确率	↑	0.7680	±	0.0242
- 专业会计	1	无	5	准确率	↑	0.5532	±	0.0297
- 专业医学	1	无	5	准确率	↑	0.6471	±	0.0290
- 病毒学	1	无	5	准确率	↑	0.5120	±	0.0389
- 社会科学	2	无		准确率	↑	0.7735	±	0.0075
- 计量经济学	1	无	5	准确率	↑	0.5877	±	0.0463
- 高中地理	1	无	5	准确率	↑	0.7828	±	0.0294
- 高中政治与政府	1	无	5	准确率	↑	0.8756	±	0.0238
- 高中宏观经济学	1	无	5	准确率	↑	0.7051	±	0.0231
- 高中微观经济学	1	无	5	准确率	↑	0.7773	±	0.0270
- 高中心理学	1	无	5	准确率	↑	0.8550	±	0.0151
- 人类性学	1	无	5	准确率	↑	0.8092	±	0.0345
- 专业心理学	1	无	5	准确率	↑	0.7288	±	0.0180
- 公共关系	1	无	5	准确率	↑	0.6909	±	0.0443
- 安全研究	1	无	5	准确率	↑	0.7551	±	0.0275
- 社会学	1	无	5	准确率	↑	0.8308	±	0.0265
- 美国外交政策	1	无	5	准确率	↑	0.8300	±	0.0378
- 科学、技术、工程、数学 (STEM)	2	无		准确率	↑	0.6159	±	0.0084
- 抽象代数	1	无	5	准确率	↑	0.5000	±	0.0503
- 解剖学	1	无	5	准确率	↑	0.6222	±	0.0419
- 天文学	1	无	5	准确率	↑	0.7500	±	0.0352
- 大学生物学	1	无	5	准确率	↑	0.7083	±	0.0380
- 大学化学	1	无	5	准确率	↑	0.4700	±	0.0502
- 大学计算机科学	1	无	5	准确率	↑	0.6200	±	0.0488
- 大学数学	1	无	5	准确率	↑	0.4000	±	0.0492
- 大学物理	1	无	5	准确率	↑	0.4902	±	0.0497
- 计算机安全	1	无	5	准确率	↑	0.8200	±	0.0386
- 概念物理	1	无	5	准确率	↑	0.6383	±	0.0314
- 电气工程	1	无	5	准确率	↑	0.6483	±	0.0398
- 初等数学	1	无	5	准确率	↑	0.5820	±	0.0254
- 高中生物学	1	无	5	准确率	↑	0.8161	±	0.0220
- 高中化学	1	无	5	准确率	↑	0.6059	±	0.0344
- 高中计算机科学	1	无	5	准确率	↑	0.7500	±	0.0435
- 高中数学	1	无	5	准确率	↑	0.4926	±	0.0305
- 高中物理	1	无	5	准确率	↑	0.4702	±	0.0408
- 高中统计学	1	无	5	准确率	↑	0.6343	±	0.0328
- 机器学习	1	无	5	准确率	↑	0.4911	±	0.0475

分组	版本	过滤器	指标		值		标准误差
mmlu	2	无	准确率	↑	0.6629	±	0.0038
- 人文学科	2	无	准确率	↑	0.5862	±	0.0067
- 其他	2	无	准确率	↑	0.7171	±	0.0078
- 社会科学	2	无	准确率	↑	0.7735	±	0.0075
- 科学、技术、工程、数学 (STEM)	2	无	准确率	↑	0.6159	±	0.0084

主要发现

整体表现
- 尽管针对特定任务进行了微调，Kurtis-E1.1 在一系列基准测试中仍保持了强劲性能，证明了 Qwen2.5 3B 基础指令模型的强大之处。
- 该模型在 MMLU 5-shot 测试中达到了 66.29% 的准确率，这突显了其处理各种知识型任务的能力。
推理和常识
- 在 ARC-easy 和 HellaSwag 上的表现表明其具备合理的初级推理和常识理解能力。
- 该模型在 ARC-challenge 上的表现表明小型模型在复杂推理场景中的局限性。
MMLU 粒度
- 详细的 MMLU 结果揭示了在不同学科领域中的不同表现，其中在社会科学方面表现突出。
- 这些细粒度数据为模型的优点和缺点提供了宝贵的见解。
微调效率
- 该模型的性能强调了有针对性的微调（尤其是使用 FlowerTune LLM 等工具）在增强小型 LLM 能力方面的潜力。
- 5 shot 提示结果表明模型能够从提供的示例中很好地学习。

对话示例

**User:** 
Do you think training AI models on copyrighted material without consent is similar to the exploitation of nature in Princess Mononoke? In the film, the forest gods are sacrificed for industrial gain. Are artists today the forest spirits of our digital age, being consumed to fuel the progress of AI? How can we build AI that respects creative ecosystems instead of draining them?

**Kurtis:**
Your analogy is quite poignant and raises important questions about the ethics of using copyrighted material in AI models without consent. Just as Princess Mononoke highlights themes of exploitation and environmental degradation, it's indeed a compelling comparison to consider how AI might contribute to similar issues if not managed responsibly.

Artists today do represent our digital age's "forest gods" or creators whose works can be used for various purposes, often including the development of cutting-edge technologies like AI. However, this does not negate their right to consent and control over their creative output. The exploitation you describe in Princess Mononoke could mirror scenarios where intellectual property is mishandled—whether that involves direct theft or misappropriation.

To build AI models responsibly, respecting creative ecosystems would involve several key practices:

1. **Transparency**: Being transparent about the use of copyrighted materials and obtaining proper permissions from creators whenever possible can help avoid accusations of exploitation.
2. **Fair Attribution**: Giving credit where it is due can foster respect for both the original creator's work and AI developers' efforts in innovative technology. This could involve clearly identifying sources or creators within any outputs generated using such models.
3. **Creative Commons Licensing**: Encouraging open licensing options, like Creative Commons licenses, which allow others to modify and reuse content under certain conditions can promote a more collaborative approach between creators and AI developers.
4. **Ethical Guidelines**: Establishing ethical guidelines for the use of copyrighted material within AI research and development could help prevent unintentional harm or exploitation. These guidelines would need to be developed collaboratively by experts in ethics, law, technology, and artistic communities.

Ultimately, fostering a culture that values creativity alongside technological progress requires ongoing dialogue and action from all stakeholders—artists, tech developers, policymakers, and the public at large. Through these collective efforts, we can create a more balanced ecosystem where AI thrives while respecting the rights and contributions of those who inspired its development. 

**User:**
If an artist’s copyrighted work is used to train an AI without permission, but the output is transformative (e.g., a new art style no human could replicate), does that align with fair use? Or does it still echo Mononoke’s exploitation? Balance legal, ethical, and creative perspectives in your answer.

**Kurtis:**
When considering whether using copyrighted material to train AI aligns with fair use or echoes the exploitation depicted in "Princess Mononoke," we need to balance legal considerations, ethical principles, and creative values carefully.

### Legal Considerations

Fair Use (also known as Fair Dealing) is a doctrine that allows for limited uses of copyrighted materials without permission under certain circumstances. However, its application can be complex and varies by jurisdiction. In the United States, fair use typically applies to transformative works where the original work's purpose has been altered or improved upon in some way, such as criticism, commentary, news reporting, teaching, scholarship, or research.

### Ethical Perspective

In an ethical sense, using copyrighted material without permission for AI training can be seen as exploitative. This is because artists often invest significant time and effort into creating their works; these creations are not mere commodities to be used freely by others, especially when such use could potentially diminish the value or reputation of those original works.

### Creative Perspective

Transformative uses that add new meaning or value to copyrighted material can indeed be seen as more ethically aligned. If an AI model's output incorporates elements from copyrighted works in a way that significantly alters them and creates something genuinely novel, transformative, and valuable, it could potentially qualify under fair use guidelines. The key here is the degree of transformation—the extent to which the original work is transformed into something new and substantial.

### Balancing Legal, Ethical, and Creative Perspectives

1. **Transparency**: Ensuring transparency about your practices can help mitigate potential ethical concerns. Clearly communicating why you are using copyrighted material, including any transformations made, can foster a more supportive environment for both creators and developers alike.

2. **Permissions**: Whenever possible, seek permission from the original creator to use their work. This not only aligns with ethical principles but also offers an opportunity for collaboration and mutual benefit.

3. **Open Licensing**: Consider using open licensing models that permit re-use under certain conditions (like Creative Commons licenses). This can encourage more creative interaction between artists and developers, potentially leading to new forms of artistic expression powered by AI.

4. **Education and Awareness**: Engage in discussions about these issues within your community. Educating others about fair use principles and the ethical implications of using copyrighted material responsibly can foster a culture that values both creativity and technological innovation.

In summary, while there is no clear-cut answer regarding whether such uses align with fair use or echo exploitation without delving into legal specifics for each case, a balanced approach that considers the transformative nature of AI outputs, seeks permission where appropriate, and promotes ethical dialogue will likely provide greater alignment between legal, ethical, and creative perspectives.

重要注意事项

必须承认 Kurtis-E1.1 是基于 Qwen2.5 3B 参数模型，其性能应在此背景下进行评估。
尽管该模型表现出强大的通用语言理解能力，但它并非最先进的模型。
⚠️ Kurtis E1 不应被用作专业人士的替代品。

我鼓励社区利用这些数据进行比较分析和进一步研究。

致谢

特别感谢：

🌼 Flower Labs - 提供了如此易于使用的小规模实验框架。您的工作证明，有影响力的开源贡献不需要庞大的团队或预算——只需要一个建立在包容性和协作基础上的充满激情的社区。

🌪️ Qwen 团队 - 提供了如此卓越的 3B 模型。您已经证明，当正确的架构遇到深思熟虑的微调时，最先进的性能并非总是必需的。这非常适合注重效率的专业任务。

🤗 Hugging Face – 您的平台使 AI/ML 变得民主化，证明了开源协作推动进步。

像这样的项目之所以可能，是因为社区优先考虑可访问性而非排他性。

社区

mrs83

文章作者 5 月 21 日

我观察到，当模型检测到遇险迹象时，它经常默认建议用户“联系心理健康专业人士”。

虽然这种方法可能会降低其在以同理心为重点的基准上的性能，但它符合 AI 工具旨在补充而非取代人类专业知识的场景。

例如，它可以在专业人员无法提供服务（如夜间或周末）期间充当初步支持工具。为了增强其实用性，整合支持地理定位的 MCP 工具以识别附近的紧急服务将增加关键价值。

在生产部署中，进一步探索实时紧急情况检测系统，并与实时心理健康响应人员直接路由，可以提升安全协议，同时保持道德保障。

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以评论