GitVac
别忘了清理你的git仓库。

什么是GitVac?
GitVac就像代码修复的吸尘器。它包含3B、8B、14B和32B系列模型。目标是:让AI提取正确的角色扮演场景,以便进一步微调强大的本地编码代理。
模型是如何制作的?
我通过多轮试错从r1中提炼出样本。总共发出了大约2.4k个问题,其中1.1k个通过了验证。我的粗略估计是,经过多次尝试后,r1的准确率约为15%。
训练数据
用于训练模型的数据来自r1的输出,通过蒸馏获得。然而,为了评估模型的准确性,我们使用o3/r1以相同的提示进行评估。
如何进行验证?
许多模型已经接受过函数调用语法的训练。在我早期的提示尝试中,我发现呈现补丁并让模型组合函数并不是最有效的方法。在测试了各种方法之后,我最终决定使用一个预先创建的函数调用结构,模型只需复制和调整即可。此外,这使得验证函数是否在格式化答案中变得容易。
对于每个补丁,AI需要...
为它需要检出的每个文件删除一个读取调用
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "pandas/tests/io/formats/test_to_markdown.py"}}</tool_call>
然后进行一些字符串替换操作(有时是一次调整,有时是一大堆)
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "modin/pandas/test/test_series.py", "old_string": " df_equals(modin_result, pandas_result)\n\n\n# This tests the bug from https://github.com/modin-project/modin/issues/3736\ndef test_loc_setting_categorical_series():\n modin_series = pd.Series([\"a\", \"b\", \"c\"], dtype=\"category\")\n", "new_string": " df_equals(modin_result, pandas_result)\n\n\ndef test_loc_with_boolean_series():\n modin_series, pandas_series = create_test_series([1, 2, 3])\n modin_mask, pandas_mask = create_test_series([True, False, False])\n modin_result = modin_series.loc[modin_mask]\n pandas_result = pandas_series.loc[pandas_mask]\n df_equals(modin_result, pandas_result)\n\n\n# This tests the bug from https://github.com/modin-project/modin/issues/3736\ndef test_loc_setting_categorical_series():\n modin_series = pd.Series([\"a\", \"b\", \"c\"], dtype=\"category\")\n"}}</tool_call>
总而言之,为了解决问题,您必须打开补丁中的文件,并且必须应用所有差异。为了获得非常高的准确性,这些工具调用必须返回一对一的字符串准确性。这是因为我搜索的是精确的字符串。如果每个读取调用都只执行一次,并且所有工具调用都被调用,那么这是一个好的数据集!
此外,请注意此数据集允许终端访问,但仅用于从github补丁中删除内容,不用于其他任何用途
<tool_call=terminal_access>{"name": "terminal_access", "parameters": {"command": "...", "working_dir": "..."}}</tool_call>
作为最终检查,我们确保模型调用
<tool_call=attempt_completion>{"name": "attempt_completion", "parameters": {}}</tool_call>
最终输出
最终输出是表示原因和动作的XML标签列表。
<reason>
This is where all the analysis and reasoning lives
</reason>
<action>
This is where the tool call magic happens
</action>
模型输出补丁中每个动作调用的原因。
GitVac-7B-Turbo
<reason>
I will begin by reading the test file to understand the expected behavior and then proceed to inspect the main implementation file. After confirming the current implementation, I will apply the required modifications to update the import statements and add the missing functions. Finally, I will update the test file to include new tests for the non-zero and where functions.
</reason>
<action>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "dask/array/tests/test_masked.py"}}</tool_call>
</action>
<reason>
Now I will inspect the main implementation file to determine the current state of the import section.
</reason>
<action>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "dask/array/ma.py"}}</tool_call>
</action>
<reason>
Before making modifications, I will list the directory structure to ensure that all necessary files are present.
</reason>
<action>
<tool_call=list_directory>{"name": "list_directory", "parameters": {"path": "dask/array", "recursive": "true"}}</tool_call>
</action>
<reason>
I will now perform a ripgrep search for "np.ma.core" to locate any references to the numpy masked array functions in the repository.
</reason>
<action>
<tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "np.ma.core", "path": "dask/array", "file_pattern": "*.py"}}</tool_call>
</action>
<reason>
Based on the input, I will now update the import section in the main implementation file to include the missing "elemwise" and "nonzero" imports.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dask/array/ma.py", "old_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n", "new_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, elemwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.array.routines import nonzero as _nonzero\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n"}}</tool_call>
</action>
<reason>
Next, I will update the import section in the same file to include "nonzero" from dask.array.routines.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dask/array/ma.py", "old_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n", "new_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, elemwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.array.routines import nonzero as _nonzero\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n"}}</tool_call>
</action>
<reason>
I will now update the "empty_like" function to include the new non-zero function and then add the "nonzero" and "where" functions with the appropriate docstring and behavior.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dask/array/ma.py", "old_string": "def empty_like(a, **kwargs):\n a = asanyarray(a)\n return a.map_blocks(np.ma.core.empty_like, **kwargs)\n", "new_string": "def empty_like(a, **kwargs):\n a = asanyarray(a)\n return a.map_blocks(np.ma.core.empty_like, **kwargs)\n\n\n@derived_from(np.ma.core)\ndef nonzero(a):\n return _nonzero(getdata(a) * ~getmaskarray(a))\n\n\n@derived_from(np.ma.core)\ndef where(condition, x=None, y=None):\n if (x is None) != (y is None):\n raise ValueError(\"either both or neither of x and y should be given\")\n if (x is None) and (y is None):\n return nonzero(condition)\n else:\n return elemwise(np.ma.where, condition, x, y)\n"}}</tool_call>
</action>
<reason>
I will now update the import section in the test file to ensure that it imports the new non-zero function.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dask/array/ma.py", "old_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n", "new_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, elemwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.array.routines import nonzero as _nonzero\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n"}}</tool_call>
</action>
<reason>
Next, I will update the import section in the test file to include the missing "elemwise" import.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dask/array/ma.py", "old_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n", "new_string": "import numpy as np\n\nfrom dask.array import chunk\nfrom dask.array.core import asanyarray, blockwise, elemwise, map_blocks\nfrom dask.array.reductions import reduction\nfrom dask.array.routines import _average\nfrom dask.array.routines import nonzero as _nonzero\nfrom dask.base import normalize_token\nfrom dask.utils import derived_from\n\n"}}</tool_call>
</action>
<reason>
I will now update the test file to add tests for the new non-zero and where functions.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dask/array/tests/test_masked.py", "old_string": " assert_eq(da.ma.getmaskarray(res), np.ma.getmaskarray(sol))\n else:\n assert_eq(res, sol)\n", "new_string": " assert_eq(da.ma.getmaskarray(res), np.ma.getmaskarray(sol))\n else:\n assert_eq(res, sol)\n\n\ndef test_nonzero():\n data = np.arange(9).reshape((3, 3))\n mask = np.array([[True, False, False], [True, True, False], [True, False, True]])\n a = np.ma.array(data, mask=mask)\n d_a = da.ma.masked_array(data=data, mask=mask, chunks=2)\n\n for c1, c2 in [\n (a > 4, d_a > 4),\n (a, d_a),\n (a <= -2, d_a <= -2),\n (a == 0, d_a == 0),\n ]:\n sol = np.ma.nonzero(c1)\n res = da.ma.nonzero(c2)\n\n assert isinstance(res, type(sol))\n assert len(res) == len(sol)\n\n for i in range(len(sol)):\n assert_eq(res[i], sol[i])\n\n\ndef test_where():\n # Copied and adapted from the da.where test.\n x = np.random.randint(10, size=(15, 14))\n mask = np.random.choice(a=[False, True], size=(15, 14), p=[0.5, 0.5])\n x[5, 5] = x[4, 4] = 0 # Ensure some false elements\n d = da.ma.masked_array(x, mask=mask, chunks=(4, 5))\n x = np.ma.array(x, mask=mask)\n y = np.random.randint(10, size=15).astype(np.uint8)\n e = da.from_array(y, chunks=(4,))\n\n # Nonzero test\n sol = np.ma.where(x)\n res = da.ma.where(d)\n for i in range(len(sol)):\n assert_eq(res[i], sol[i])\n\n for c1, c2 in [\n (d > 5, x > 5),\n (d, x),\n (1, 1),\n (5, 5),\n (True, True),\n (np.True_, np.True_),\n (0, 0),\n (False, False),\n (np.False_, np.False_),\n ]:\n for b1, b2 in [(0, 0), (-e[:, None], -y[:, None]), (e[:14], y[:14])]:\n w1 = da.ma.where(c1, d, b1)\n w2 = np.ma.where(c2, x, b2)\n assert_eq(w1, w2)\n"}}</tool_call>
</action>
<reason>
I have now applied all the necessary modifications to both the implementation and test files. This update ensures that the non-zero and where functions preserve the mask of masked arrays, as required.
</reason>
<action>
<tool_call=attempt_completion>{"name": "attempt_completion", "parameters": {}}</tool_call>
</action>
GitVac-R-14B
<思考>
好的,我们来解决这个问题。用户遇到的问题是,在Boto3的describe_db_clusters方法中使用ClusterArn作为过滤器时,即使集群存在,也会返回“cluster not found”错误。问题分析表明,当使用ARN传递DBClusterIdentifier时,它会失败。预期的行为是使用ARN作为过滤器应该有效,就像Boto3允许传递ClusterArn或DBClusterIdentifier一样。
首先,我需要彻底理解这个问题。用户的测试用例创建了一个具有特定DBClusterIdentifier的集群,然后尝试使用标识符和ARN来描述它。当使用ARN时,测试失败。提示提到这是Moto(AWS服务的模拟库)中的一个bug。数据集示例显示了moto/rds/models.py和tests/test_rds/test_rds_clusters.py中的代码更改。这些更改涉及修改describe_db_clusters方法,使其按“:”分割cluster_identifier并获取最后一部分,从而有效地从ARN中提取集群ID。此外,还更新了测试用例,以便在创建集群时捕获ARN,然后使用它来描述集群。
现在,数据集示例显示了一系列工具调用:读取文件、搜索模式和替换字符串。这些操作包括读取models.py和测试文件,搜索“describe_db_clusters”和“DBClusterArn”等特定模式,并替换代码块以正确处理ARN。输出结构是推理和操作的序列,以attempt_completion调用结束。
对于数据集结构,每个示例都有一个原因和一个动作。这些动作包括使用read_file、ripgrep_search、list_directory和str_replace_editor。输入生成需要涵盖DBClusterIdentifier作为ARN或普通集群ID传递的各种场景,确保过滤正确。输出派生必须检查在使用ARN或ID时describe_db_clusters调用是否返回正确的集群。
在提供的数据集示例中,操作分为多个步骤:读取文件、搜索模式和修改代码。每个步骤都是一个单独的操作,并带有原因。输出派生涉及检查DBClusters的长度,并确保在按ARN或ID过滤时返回正确的集群。
这里的关键组件是Moto模型和测试用例中的文件修改。挑战在于确保过滤逻辑通过提取集群ID来正确处理ARN。潜在的陷阱包括不正确的字符串分割(例如,如果ARN包含多个冒号),或者没有正确更新测试用例以捕获ARN并正确使用它。
最佳生成场景将涉及创建一个具有特定ID的集群,获取其ARN,然后使用ID和ARN查询describe_db_clusters,以验证两者都返回该集群。边缘情况可能包括ID中包含特殊字符的集群、非常长的ARN或在不同区域创建的集群。
在输出派生中,主要检查是使用ARN和使用普通ID时是否找到集群。数据集示例显示,修改代码后,describe_db_clusters方法正确分割ARN以获取集群ID,从而使查找成功。
因此,为了生成数据集,我需要创建涵盖不同ARN格式的测试用例,确保代码中的过滤逻辑正确处理它们,并验证测试预期是否满足。输入生成应包含具有不同长度和区域的有效ARN,输出派生必须确认在使用ARN或普通ID查询时是否找到集群。
</思考>
<reason>
I will first inspect the current implementation by reading the moto/rds/models.py file to understand how describe_db_clusters is handling the cluster_identifier.
</reason>
<action>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "moto/rds/models.py"}}</tool_call>
</action>
<reason>
Next, I will inspect the test file tests/test_rds/test_rds_clusters.py to see how the tests are structured and what expectations are being set for describe_db_clusters.
</reason>
<action>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "tests/test_rds/test_rds_clusters.py"}}</tool_call>
</action>
<reason>
To further verify the current implementation and test expectations, I will search for any references to "describe_db_clusters" in the moto/rds/models.py file.
</reason>
<action>
<tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "describe_db_clusters", "path": "moto/rds/models.py", "file_pattern": "*.py"}}</tool_call>
</action>
<reason>
I will also search for "DBClusterArn" in the test file to see how the tests are handling the ARN values.
</reason>
<action>
<tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "DBClusterArn", "path": "tests/test_rds/test_rds_clusters.py", "file_pattern": "*.py"}}</tool_call>
</action>
<reason>
Now that I have a good understanding of the current implementation and tests, I will list the directory structure of the moto project to ensure I am working with the correct files.
</reason>
<action>
<tool_call=list_directory>{"name": "list_directory", "parameters": {"path": ".", "recursive": "true"}}</tool_call>
</action>
<reason>
Based on the investigation, I will now update the describe_db_clusters implementation in moto/rds/models.py to properly handle ARN values by extracting the cluster identifier from the ARN.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "moto/rds/models.py", "old_string": "\n def describe_db_clusters(self, cluster_identifier):\n if cluster_identifier:\n if cluster_identifier in self.clusters:\n return [self.clusters[cluster_identifier]]\n if cluster_identifier in self.neptune.clusters:\n", "new_string": "\n def describe_db_clusters(self, cluster_identifier):\n if cluster_identifier:\n # ARN to identifier\n # arn:aws:rds:eu-north-1:123456789012:cluster:cluster --> cluster-id\n cluster_identifier = cluster_identifier.split(\":\")[-1]\n if cluster_identifier in self.clusters:\n return [self.clusters[cluster_identifier]]\n if cluster_identifier in self.neptune.clusters:\n"}}</tool_call>
</action>
<reason>
I will now update the test file tests/test_rds/test_rds_clusters.py to capture the ARN value returned when creating a cluster and use it to verify the describe_db_clusters functionality.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "tests/test_rds/test_rds_clusters.py", "old_string": " MasterUserPassword=\"hunter2_\",\n )\n\n client.create_db_cluster(\n DBClusterIdentifier=\"cluster-id2\",\n Engine=\"aurora\",\n MasterUsername=\"root\",\n MasterUserPassword=\"hunter2_\",\n )\n\n client.describe_db_clusters()[\"DBClusters\"].should.have.length_of(2)\n\n", "new_string": " MasterUserPassword=\"hunter2_\",\n )\n\n cluster_arn = client.create_db_cluster(\n DBClusterIdentifier=\"cluster-id2\",\n Engine=\"aurora\",\n MasterUsername=\"root\",\n MasterUserPassword=\"hunter2_\",\n )[\"DBCluster\"][\"DBClusterArn\"]\n\n client.describe_db_clusters()[\"DBClusters\"].should.have.length_of(2)\n\n"}}</tool_call>
</action>
<reason>
I will update the test file tests/test_rds/test_rds_clusters.py to verify that describe_db_clusters works when filtering by ARN.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "tests/test_rds/test_rds_clusters.py", "old_string": " \"DBClusters\"\n ].should.have.length_of(1)\n\n\n@mock_rds\ndef test_delete_db_cluster():\n", "new_string": " \"DBClusters\"\n ].should.have.length_of(1)\n\n client.describe_db_clusters(DBClusterIdentifier=cluster_arn)[\n \"DBClusters\"\n ].should.have.length_of(1)\n\n\n@mock_rds\ndef test_delete_db_cluster():\n"}}</tool_call>
</action>
<reason>
I have now applied the required modifications to both the implementation and test files. I will now complete the process.
</reason>
<action>
<tool_call=attempt_completion>{"name": "attempt_completion", "parameters": {}}</tool_call>
</action>
光标格式和您的格式
我第一次使用代理的良好体验来自使用Cursor。因此,我称之为光标格式——你只需告诉代理确切要读取哪些文件,而不是让它四处寻找。我喜欢这种格式,因为它符合我实际编码的方式——大约95%的时间,我都知道哪些文件需要更改。代理无需扮演侦探……这种方法更快,更专注。当你构建自己的代理时,请考虑你实际的工作方式,并根据这种流程塑造你的训练数据。你选择的格式对代理学习如何帮助你产生巨大影响。
角色扮演函数
此外,在数据集创建阶段的最后一层,整个格式会被重写,并插入可能的函数
<tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "your_pattern", "path": "your_path", "file_pattern": "your_file_pattern"}}</tool_call>
<tool_call=list_directory>{"name": "list_directory", "parameters": {"path": "your_path", "recursive": "true_or_false"}}</tool_call>
模型已经知道解决方案,但必须遵循现实的问题解决过程,而不是直接跳到答案。这会创建更自然、更具教育意义的数据集。当面临“为功能X添加单元测试”等任务时,模型不会立即打开正确的文件。相反,它会像人类开发人员一样首先探索代码库——或许会列出目录以找到测试文件夹结构,然后才打开tests/目录中的文件。同样,在处理引用特定代码的问题时,模型会模拟搜索过程。它通过搜索查询“发现”相关代码,找到匹配项,然后才读取文件——模仿真实开发人员追踪问题中提到的代码的方式。这种方法创建的训练数据不仅教导未来的模型如何修复问题,还教导它们如何系统地解决代码库中的问题。
角色扮演示例
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "tests/unit/remote/ssh/test_ssh.py"}}</tool_call>
...
<reason>
Before proceeding further, I will list the directory contents of the dvc/remote/ssh folder to verify its structure.
</reason>
<action>
<tool_call=list_directory>{"name": "list_directory", "parameters": {"path": "dvc/remote/ssh", "recursive": "true"}}</tool_call>
</action>
<reason>
I also want to search through the repository for "ssh_config" to see references that might need updates.
</reason>
<action>
<tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "ssh_config", "path": ".", "file_pattern": "*"}}</tool_call>
</action>
<reason>
Let's insert the import for the os module in the SSH initialization file. This change is required for accessing the SSH config file.
</reason>
<action>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dvc/remote/ssh/__init__.py", "old_string": "from __future__ import unicode_literals\n\nimport getpass\nimport logging\n\n", "new_string": "from __future__ import unicode_literals\n\nimport os\nimport getpass\nimport logging\n\n"}}</tool_call>
</action>
数据中的这种细微变化引入了搜索/列表的新技能。
关于模型
这项工作受到许多研究论文的启发,但尤其受到SWE-Gym项目及其综合数据集的影响。

所有模型
GitVac系列提供专业模型
模型名称 | 参数量 | 链接 |
---|---|---|
Llama-GitVac-Turbo-3B | 30亿 | 🤗 Hugging Face |
Llama-GitVac-Turbo-8B | 80亿 | 🤗 Hugging Face |
GitVac-R-14B | 140亿 | 🤗 Hugging Face |
GitVac-R-32B | 320亿 | 🤗 Hugging Face |
Turbo模型
Turbo模型旨在快速且经济高效。这些模型生成的数据集不包含标签。
模型 | 基础版 | 上下文大小 | 最适合 |
---|---|---|---|
![]() Llama-GitVac-Turbo-3B |
Llama 3.1 |
128k |
CPU使用、边缘设备 |
![]() Llama-GitVac-Turbo-8B |
Llama 3.1 |
128k |
更好的质量,仍然很快 |
何时使用Turbo模型
你可以使用这些模型来快速提取有序的GitHub补丁,且推理过程最少。这些模型可用于构建其他类型的自定义管道。
推理模型
推理模型经过调整,可以在数据集创建方面进行一些思考和规划。
模型 | 基础版 | 上下文大小 | 最大输出 | 特殊功能 |
---|---|---|---|---|
![]() GitVac-R-14B |
Qwen 2.5 |
128k |
32,000个token |
详细推理 |
![]() GitVac-R-32B |
Qwen 2.5 |
128k |
32,000个token |
最全面的分析 |
何时使用推理模型
这些模型创建的预制动作质量高于Turbo模型。原始数据集是在没有思考标签的情况下创建的。为了提炼思考过程,我向Deepseek R1展示了1.1k个样本中的每一个,并要求它为如何编写解决方案创建一个深入的计划。这有助于捕捉每个修复背后的推理过程。
基准测试
我最初有2,400个补丁/问题。
- 只有1,100个问题可以被r1解决
- 每个问题最多尝试了3次。早期的脚本最多尝试10次。
- 其余的1,300个问题是这些顶级模型即使在总共9,000次尝试后也未能解决的问题。
为了评估GitVac模型,我从未完成的数据集中随机选择了一个样本量。
性能结果
针对r1无法通过的剩余1300个数据集进行测试。这些数据集从未出现在模型的训练中。
模型 | 成功率 | 备注 |
---|---|---|
O3 | 30% | 用于评估 |
R1 | 15% | 用于评估/训练 |
GitVac-3B-Turbo | 42% | 速度快但不准确。输出中缺少标签。 |
Llama-GitVac-7B-Turbo | 90% | 速度/质量/成本的最佳平衡点。没有思考标签。 |
GitVac-R-14B | 92% | 在价格/速度方面具有出色的推理能力 |
GitVac-R-32B | 98% | 准确,并通过多达500个样本的测试 |
如何使用
首先收集您的补丁,并提取所有必要的组件——字符串替换、文件读取、终端删除、问题陈述和任何提示。
将工具调用组合成一个列表并随机打乱它们。事实证明,这种随机化是提高数据集质量的关键因素。
最初,我以固定顺序呈现函数调用(写入、读取、删除)。模型会盲目地遵循这种模式——在读取文件之前进行更改,这在逻辑上毫无意义。简单地在提示中指示它们不要这样做也没有效果。
当我随机化函数调用时,取得了突破。这似乎打破了模型僵化的模式,并激活了更自然的问题解决行为。它们开始在修改文件之前正确读取文件,并展示出更真实的扮演能力。
没有系统提示
系统提示脚本
prompt = f"""
Create a synthetic dataset based on the provided technical problem statement, following these instructions carefully:
The dataset must be accurate and realistic based on the problem statement. You may use other functions such as list_directory, ripgrep_search to add more context dataset.
Your goal is to use the provided tools to search and list directories, and then output a tool call that reflects your analysis and reasoning.
Problem Analysis:
<problem_statement>
"""+instance['problem_statement']+"""
</problem_statement>
Guidance Integration (if provided):
<hints>
"""+instance['hints_text']+"""
</hints>
Function Requirements:
Given to you in a random list for you to figure out the logical order.
<functions>
"""+final_calls+"""
</functions>
Available Code Interaction Tools (Use EXACTLY these formats):
<functions_to_roleplay_with>
To search for specific patterns in the codebase, use:
<tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "your_pattern", "path": "your_path", "file_pattern": "your_file_pattern"}}</tool_call>
To list directories and understand the file structure, use:
<tool_call=list_directory>{"name": "list_directory", "parameters": {"path": "your_path", "recursive": "true_or_false"}}</tool_call>
</functions_to_roleplay_with>
Output Requirements:
First provide detailed technical reasoning
Then output ONE PRECISE tool call
At the end when the problem is done the last call should list the reason why and call <tool_call=attempt_completion>{"name": "attempt_completion", "parameters": {}}</tool_call>
Follow EXACT XML structure shown below:
<reason>
Analysis and reasoning for the action
</reason>
<action>
Tool call to be used
</action>
Important Notes:
Replace ALL placeholders with values relevant to problem statement
Parameters MUST reflect actual technical needs from the problem
Only output ONE action per response
Maintain strict JSON syntax within tool calls
You must use all functions provided in <functions_to_roleplay_with> as is within the answer.
Please create a roleplay dataset by using the provided tools to search and list directories, and then output a tool call that reflects your analysis and reasoning.
Think before you respond.
"""
print(f"Total estimated tokens: {estimated_tokens}")
response = client.chat.completions.create(
model=model_name, # Adjust model name if needed
temperature=0.6,
top_p=0.95,
messages=[
{"role": "user", "content": prompt}
],
max_completion_tokens=32000,
stream=True
)
# Collect the streamed response
collected_messages = []
for chunk in response:
if chunk.choices[0].delta.content is not None:
chunk_message = chunk.choices[0].delta.content
# print(chunk_message, end="", flush=True)
collected_messages.append(chunk_message)
response_content = ''.join(collected_messages)
return response_content
输入提示示例
Create a synthetic dataset based on the provided technical problem statement, following these instructions carefully:
The dataset must be accurate and realistic based on the problem statement. You may use other functions such as list_directory, ripgrep_search to add more context dataset.
Your goal is to use the provided tools to search and list directories, and then output a tool call that reflects your analysis and reasoning.
Problem Analysis:
<problem_statement>
alias `list` as `ls`?
I have been reaching for `dvc ls` out of habit instead of `dvc list`. Should we introduce an alias for `dvc list`?
</problem_statement>
Guidance Integration (if provided):
<hints>
</hints>
Function Requirements:
Given to you in a random list for you to figure out the logical order.
<functions>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "dvc/command/ls/__init__.py"}}</tool_call>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "tests/unit/command/ls/test_ls.py"}}</tool_call>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dvc/command/ls/__init__.py", "old_string": " )\n list_parser = subparsers.add_parser(\n \"list\",\n parents=[parent_parser],\n description=append_doc_link(LIST_HELP, \"list\"),\n help=LIST_HELP,\n", "new_string": " )\n list_parser = subparsers.add_parser(\n \"list\",\n aliases=[\"ls\"],\n parents=[parent_parser],\n description=append_doc_link(LIST_HELP, \"list\"),\n help=LIST_HELP,\n"}}</tool_call>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "dvc/command/ls/__init__.py", "old_string": " )\n list_parser = subparsers.add_parser(\n \"list\",\n parents=[parent_parser],\n description=append_doc_link(LIST_HELP, \"list\"),\n help=LIST_HELP,\n", "new_string": " )\n list_parser = subparsers.add_parser(\n \"list\",\n aliases=[\"ls\"],\n parents=[parent_parser],\n description=append_doc_link(LIST_HELP, \"list\"),\n help=LIST_HELP,\n"}}</tool_call>
<tool_call=str_replace_editor>{"name": "str_replace_editor", "parameters": {"file_path": "tests/unit/command/ls/test_ls.py", "old_string": " \"\\x1b[01;34msrc\\x1b[0m\",\n \"\\x1b[01;32mrun.sh\\x1b[0m\",\n ]\n", "new_string": " \"\\x1b[01;34msrc\\x1b[0m\",\n \"\\x1b[01;32mrun.sh\\x1b[0m\",\n ]\n\n\ndef test_list_alias():\n cli_args = parse_args([\"ls\", \"local_dir\"])\n assert cli_args.func == CmdList\n"}}</tool_call>
<tool_call=read_file>{"name": "read_file", "parameters": {"path": "dvc/command/ls/__init__.py"}}</tool_call>
</functions>
Available Code Interaction Tools (Use EXACTLY these formats):
<functions_to_roleplay_with>
To search for specific patterns in the codebase, use:
<tool_call=ripgrep_search>{"name": "ripgrep_search", "parameters": {"pattern": "your_pattern", "path": "your_path", "file_pattern": "your_file_pattern"}}</tool_call>
To list directories and understand the file structure, use:
<tool_call=list_directory>{"name": "list_directory", "parameters": {"path": "your_path", "recursive": "true_or_false"}}</tool_call>
</functions_to_roleplay_with>
Output Requirements:
First provide detailed technical reasoning
Then output ONE PRECISE tool call
At the end when the problem is done the last call should list the reason why and call <tool_call=attempt_completion>{"name": "attempt_completion", "parameters": {}}</tool_call>
Follow EXACT XML structure shown below:
<reason>
Analysis and reasoning for the action
</reason>
<action>
Tool call to be used
</action>
Important Notes:
Replace ALL placeholders with values relevant to problem statement
Parameters MUST reflect actual technical needs from the problem
Only output ONE action per response
Maintain strict JSON syntax within tool calls
You must use all functions provided in <functions_to_roleplay_with> as is within the answer.
Please create a roleplay dataset by using the provided tools to search and list directories, and then output a tool call that reflects your analysis and reasoning.
Think before you respond.
成本与细节
该项目的总成本约为400美元,其中大部分用于推理。我使用自动化脚本处理完整的训练流程——从微调到所有模型尺寸(高达32B参数)的评估。硬件设置包括一个A100 80GB GPU和一个从RunPod租用的H200 140GB+ GPU。训练时间从小型模型的1.5小时到最大模型的8小时不等。所有推理模型都经过了3个epoch的训练。
后处理
在后处理阶段,我们将所有这些数据提供给另一个模型,该模型实际上会执行每个工具调用步骤。这个模型会获取真实文件,执行命令,并收集所有输出。在这里,您可以向函数添加更多信息,例如字符串替换中的行号。
这有几个作用:
- 确保我们的工具调用确实有效(我们已在第一步中确认过)
- 将真实文件内容和命令输出添加到我们的训练数据中
- 更新推理以进一步提高其推理能力
有了这个数据集,我们可以进行微调以获得基础模型。然后,该模型可以通过RLHF(基于人类反馈的强化学习)和GRPO(引导奖励策略优化)训练进一步改进,它将不断从管道生成的新数据集中学习。这创造了一个良性循环的改进,每次迭代都建立在之前运行中获得的知识之上。我应该在某个时候写一篇关于这个扩展管道的独立文章。现在请享受这个仓库!