开源 AI 食谱 文档

数据分析师代理:一眨眼就获得数据的洞察力 ✨

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

Open In Colab

数据分析师代理:一眨眼就获得数据的洞察力 ✨

作者:Aymeric Roucher

本教程是高级的。你应该首先了解另一个食谱中的概念!

在本笔记本中,我们将创建一个 数据分析师代理:一个配备数据分析库的代码代理,它可以加载和转换数据帧,以从您的数据中提取洞察力,甚至可以绘制结果!

假设我想分析来自 Kaggle 泰坦尼克号挑战赛 的数据,以预测个别乘客的生存率。但在我自己深入研究之前,我希望有一个自主代理为我准备分析,通过提取趋势和绘制一些图表来找到洞察力。

让我们搭建这个系统。

运行下面这行代码来安装所需的依赖项

!pip install seaborn smolagents transformers -q -U

我们首先创建代理。我们使用了 CodeAgent(阅读文档以了解更多关于代理类型的信息),所以我们甚至不需要给它任何工具:它可以直接运行其代码。

我们只需确保通过在 additional_authorized_imports 中传递以下内容来让它使用与数据科学相关的库:["numpy", "pandas", "matplotlib.pyplot", "seaborn"]

一般来说,当在 additional_authorized_imports 中传递库时,请确保它们已安装在您的本地环境中,因为 python 解释器只能使用您环境中安装的库。

⚙ 我们的代理将由 meta-llama/Llama-3.1-70B-Instruct 提供支持,使用 HfApiModel 类,该类使用 HF 的 Inference API:Inference API 允许快速轻松地免费运行任何开放模型!

from smolagents import HfApiModel, CodeAgent
from huggingface_hub import login
import os

login(os.getenv("HUGGINGFACEHUB_API_TOKEN"))

model = HfApiModel("meta-llama/Llama-3.1-70B-Instruct")

agent = CodeAgent(
    tools=[],
    model=model,
    additional_authorized_imports=["numpy", "pandas", "matplotlib.pyplot", "seaborn"],
    max_iterations=10,
)

数据分析 📊🤔

在运行代理时,我们向其提供了直接从竞赛中获取的附加说明,并将这些作为 kwargs 传递给 run 方法

import os

os.mkdir("./figures")
>>> additional_notes = """
... ### Variable Notes
... pclass: A proxy for socio-economic status (SES)
... 1st = Upper
... 2nd = Middle
... 3rd = Lower
... age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
... sibsp: The dataset defines family relations in this way...
... Sibling = brother, sister, stepbrother, stepsister
... Spouse = husband, wife (mistresses and fiancés were ignored)
... parch: The dataset defines family relations in this way...
... Parent = mother, father
... Child = daughter, son, stepdaughter, stepson
... Some children travelled only with a nanny, therefore parch=0 for them.
... """

>>> analysis = agent.run(
...     """You are an expert data analyst.
... Please load the source file and analyze its content.
... According to the variables you have, begin by listing 3 interesting questions that could be asked on this data, for instance about specific correlations with survival rate.
... Then answer these questions one by one, by finding the relevant numbers.
... Meanwhile, plot some figures using matplotlib/seaborn and save them to the (already existing) folder './figures/': take care to clear each figure with plt.clf() before doing another plot.

... In your final answer: summarize these correlations and trends
... After each number derive real worlds insights, for instance: "Correlation between is_december and boredness is 1.3453, which suggest people are more bored in winter".
... Your final answer should have at least 3 numbered and detailed parts.
... """,
...     additional_args=dict(additional_notes=additional_notes, source_file="titanic/train.csv"),
... )
>>> print(analysis)
The analysis of the Titanic data reveals that socio-economic status and sex are significant factors in determining survival rates. Passengers with lower socio-economic status and males are less likely to survive. The age of a passenger has a minimal impact on their survival rate.

令人印象深刻,不是吗?您还可以为您的代理提供可视化工具,以便它可以反思自己的图表!

数据科学家代理:运行预测 🛠️

👉 现在让我们进一步深入:我们将让我们的模型对数据执行预测。

为此,我们还在 additional_authorized_imports 中让它使用 sklearn

agent = CodeAgent(
    tools=[],
    model=model,
    additional_authorized_imports=[
        "numpy",
        "pandas",
        "matplotlib.pyplot",
        "seaborn",
        "sklearn",
    ],
    max_iterations=12,
)

output = agent.run(
    """You are an expert machine learning engineer.
Please train a ML model on "titanic/train.csv" to predict the survival for rows of "titanic/test.csv".
Output the results under './output.csv'.
Take care to import functions and modules before using them!
""",
    additional_args=dict(additional_notes=additional_notes + "\n" + analysis),
)

即使代理出现了一些错误,但最终还是成功解决了问题!

代理在上面输出的测试预测,一旦提交给 Kaggle,得分 0.78229,在 17,360 名参赛者中排名第 2824 位,并且比我多年前第一次尝试这个挑战时痛苦地取得的成绩还要好。

您的结果可能会有所不同,但无论如何,我发现在一个代理在几秒钟内完成这项工作非常令人印象深刻。

🚀 以上只是对代理数据分析师的初步尝试:它当然可以进行大量改进,以更好地适应您的用例!

< > 更新 在 GitHub 上