数据分析师代理：眨眼间获取您的数据洞察 ✨

本教程是高级教程。您应该首先具备此食谱中的知识！

在本笔记本中，我们将创建一个数据分析师代理：一个配备数据分析库的代码代理，可以加载和转换数据框，从您的数据中提取洞察，甚至绘制结果！

假设我想分析来自 Kaggle 泰坦尼克号挑战赛的数据，以预测单个乘客的生存。但在我亲自深入研究之前，我希望一个自主代理为我准备分析，通过提取趋势并绘制一些图表来寻找洞察。

让我们来设置这个系统。

运行以下行安装所需的依赖项

!pip install seaborn smolagents transformers -q -U

我们首先创建代理。我们使用了 CodeAgent（阅读文档了解更多关于代理类型的信息），所以我们甚至不需要给它任何工具：它可以直接运行其代码。

我们只需确保通过在 additional_authorized_imports 中传入这些与数据科学相关的库来让它使用它们：["numpy", "pandas", "matplotlib.pyplot", "seaborn"]。

通常，在 additional_authorized_imports 中传入库时，请确保它们已安装在您的本地环境中，因为 Python 解释器只能使用您环境中安装的库。

⚙ 我们的代理将由 meta-llama/Llama-3.1-70B-Instruct 提供支持，使用 HfApiModel 类，该类使用 HF 的推理 API：推理 API 允许免费快速轻松地运行任何开源模型！

from smolagents import InferenceClientModel, CodeAgent
from huggingface_hub import login
import os

login(os.getenv("HUGGINGFACEHUB_API_TOKEN"))

model = InferenceClientModel("meta-llama/Llama-3.1-70B-Instruct")

agent = CodeAgent(
    tools=[],
    model=model,
    additional_authorized_imports=["numpy", "pandas", "matplotlib.pyplot", "seaborn"],
    max_iterations=10,
)

数据分析 📊🤔

运行代理后，我们向其提供直接取自竞赛的附加说明，并将其作为 kwarg 传递给 run 方法

import os

os.mkdir("./figures")

>>> additional_notes = """
... ### Variable Notes
... pclass: A proxy for socio-economic status (SES)
... 1st = Upper
... 2nd = Middle
... 3rd = Lower
... age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
... sibsp: The dataset defines family relations in this way...
... Sibling = brother, sister, stepbrother, stepsister
... Spouse = husband, wife (mistresses and fiancés were ignored)
... parch: The dataset defines family relations in this way...
... Parent = mother, father
... Child = daughter, son, stepdaughter, stepson
... Some children travelled only with a nanny, therefore parch=0 for them.
... """

>>> analysis = agent.run(
...     """You are an expert data analyst.
... Please load the source file and analyze its content.
... According to the variables you have, begin by listing 3 interesting questions that could be asked on this data, for instance about specific correlations with survival rate.
... Then answer these questions one by one, by finding the relevant numbers.
... Meanwhile, plot some figures using matplotlib/seaborn and save them to the (already existing) folder './figures/': take care to clear each figure with plt.clf() before doing another plot.

... In your final answer: summarize these correlations and trends
... After each number derive real worlds insights, for instance: "Correlation between is_december and boredness is 1.3453, which suggest people are more bored in winter".
... Your final answer should have at least 3 numbered and detailed parts.
... """,
...     additional_args=dict(additional_notes=additional_notes, source_file="titanic/train.csv"),
... )

>>> print(analysis)

The analysis of the Titanic data reveals that socio-economic status and sex are significant factors in determining survival rates. Passengers with lower socio-economic status and males are less likely to survive. The age of a passenger has a minimal impact on their survival rate.

令人印象深刻，不是吗？您还可以为您的代理提供可视化工具，让它能够反思自己的图表！

数据科学家代理：运行预测 🛠️

👉 现在让我们更深入地研究：我们将让我们的模型对数据进行预测。

为此，我们还允许它在 additional_authorized_imports 中使用 sklearn。

agent = CodeAgent(
    tools=[],
    model=model,
    additional_authorized_imports=[
        "numpy",
        "pandas",
        "matplotlib.pyplot",
        "seaborn",
        "sklearn",
    ],
    max_iterations=12,
)

output = agent.run(
    """You are an expert machine learning engineer.
Please train a ML model on "titanic/train.csv" to predict the survival for rows of "titanic/test.csv".
Output the results under './output.csv'.
Take care to import functions and modules before using them!
""",
    additional_args=dict(additional_notes=additional_notes + "\n" + analysis),
)

尽管代理出现了一些错误，但最终还是成功地解决了问题！

代理输出的测试预测，一旦提交给 Kaggle，得分是 0.78229，在 17,360 名参赛者中排名第 2824，比我几年前首次尝试这项挑战时费力取得的成绩要好。

您的结果会因人而异，但无论如何，我觉得在几秒钟内用一个代理实现这一点非常令人印象深刻。

🚀 以上只是数据分析师代理的初步尝试：它肯定可以进行大量改进，以更好地适应您的用例！

< > 在 GitHub 上更新

开源 AI 食谱

数据分析师代理：眨眼间获取您的数据洞察 ✨

数据分析 📊🤔

数据科学家代理：运行预测 🛠️