数据分析师代理:一眨眼就能获取数据的见解 ✨
本教程属于高级教程。您应该先了解另一个食谱中的概念!
在本笔记本中,我们将创建一个**数据分析师代理:一个配备了数据分析库的代码代理,可以加载和转换数据帧以从您的数据中提取见解,甚至绘制结果!**
假设我想分析Kaggle 泰坦尼克号挑战中的数据,以预测单个乘客的生存情况。但在自己深入研究之前,我想要一个自主代理为我准备分析,通过提取趋势和绘制一些图形来找到见解。
让我们建立这个系统。
运行下面的代码行以安装所需的依赖项
!pip install seaborn "transformers[agents]"
我们首先创建代理。我们使用了一个ReactCodeAgent
(阅读文档以了解更多关于代理类型的信息),因此我们甚至不需要提供任何工具:它可以直接运行其代码。
我们只需确保通过在additional_authorized_imports
中传递这些参数来允许它使用与数据科学相关的库:["numpy", "pandas", "matplotlib.pyplot", "seaborn"]
。
通常,在additional_authorized_imports
中传递库时,请确保它们已安装在您的本地环境中,因为 Python 解释器只能使用安装在您环境中的库。
⚙ 我们代理将由meta-llama/Meta-Llama-3.1-70B-Instruct驱动,使用HfEngine
类,该类使用 HF 的推理 API:推理 API 允许快速轻松地运行任何 OS 模型。
from transformers.agents import HfEngine, ReactCodeAgent
from huggingface_hub import login
import os
login(os.getenv("HUGGINGFACEHUB_API_TOKEN"))
llm_engine = HfEngine("meta-llama/Meta-Llama-3.1-70B-Instruct")
agent = ReactCodeAgent(
tools=[],
llm_engine=llm_engine,
additional_authorized_imports=["numpy", "pandas", "matplotlib.pyplot", "seaborn"],
max_iterations=10,
)
数据分析 📊🤔
在运行代理后,我们提供来自竞赛的直接笔记,并将其作为关键字参数传递给run
方法
import os
os.mkdir("./figures")
additional_notes = """
### Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
"""
analysis = agent.run(
"""You are an expert data analyst.
Please load the source file and analyze its content.
According to the variables you have, begin by listing 3 interesting questions that could be asked on this data, for instance about specific correlations with survival rate.
Then answer these questions one by one, by finding the relevant numbers.
Meanwhile, plot some figures using matplotlib/seaborn and save them to the (already existing) folder './figures/': take care to clear each figure with plt.clf() before doing another plot.
In your final answer: summarize these correlations and trends
After each number derive real worlds insights, for instance: "Correlation between is_december and boredness is 1.3453, which suggest people are more bored in winter".
Your final answer should have at least 3 numbered and detailed parts.
""",
additional_notes=additional_notes,
source_file="titanic/train.csv",
)
>>> print(analysis)
Here are the correlations and trends found in the data: 1. **Correlation between age and survival rate**: The correlation is -0.0772, which suggests that as age increases, the survival rate decreases. This implies that older passengers were less likely to survive the Titanic disaster. 2. **Relationship between Pclass and survival rate**: The survival rates for each Pclass are: - Pclass 1: 62.96% - Pclass 2: 47.28% - Pclass 3: 24.24% This shows that passengers in higher socio-economic classes (Pclass 1 and 2) had a significantly higher survival rate compared to those in the lower class (Pclass 3). 3. **Relationship between fare and survival rate**: The correlation is 0.2573, which suggests a moderate positive relationship between fare and survival rate. This implies that passengers who paid higher fares were more likely to survive the disaster.
令人印象深刻,不是吗?您还可以向代理提供可视化工具,让它反思自己的图表!
数据科学家代理:运行预测 🛠️
👉 现在让我们更深入一点:**我们将让我们的模型对数据进行预测。**
为此,我们还让它在additional_authorized_imports
中使用sklearn
。
agent = ReactCodeAgent(
tools=[],
llm_engine=llm_engine,
additional_authorized_imports=[
"numpy",
"pandas",
"matplotlib.pyplot",
"seaborn",
"sklearn",
],
max_iterations=12,
)
output = agent.run(
"""You are an expert machine learning engineer.
Please train a ML model on "titanic/train.csv" to predict the survival for rows of "titanic/test.csv".
Output the results under './output.csv'.
Take care to import functions and modules before using them!
""",
additional_notes=additional_notes + "\n" + analysis,
)
代理在上面输出的测试预测提交到 Kaggle 后,得分**0.78229**,在 17,360 个结果中排名第 2824 位,比我几年前第一次尝试挑战时痛苦地取得的结果要好。
您的结果会有所不同,但无论如何我认为在几秒钟内用一个代理实现这一点非常令人印象深刻。
🚀 上面只是对代理数据分析师的简单尝试:它肯定可以改进很多,以更好地适应您的用例!
< > 在 GitHub 上更新