获取行数和字节大小
本指南介绍如何使用数据集查看器的 /size
端点以编程方式检索数据集的大小。也可以使用 ReDoc 进行尝试。
/size
端点将数据集名称作为查询参数。
Python
JavaScript
cURL
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/size?dataset=ibm/duorc"
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()
端点响应是包含数据集大小的 JSON,以及其每个子集和拆分。它提供了不同形式数据的行数、列数(适用时)以及字节大小:原始文件、内存大小(RAM)和自动转换的 Parquet 文件。例如,ibm/duorc 数据集在其所有子集和拆分中共有 187.213 行,总大小为 97MB。
{
"size":{
"dataset":{
"dataset":"ibm/duorc",
"num_bytes_original_files":58710973,
"num_bytes_parquet_files":58710973,
"num_bytes_memory":1060742354,
"num_rows":187213
},
"configs":[
{
"dataset":"ibm/duorc",
"config":"ParaphraseRC",
"num_bytes_original_files":37709127,
"num_bytes_parquet_files":37709127,
"num_bytes_memory":704394283,
"num_rows":100972,
"num_columns":7
},
{
"dataset":"ibm/duorc",
"config":"SelfRC",
"num_bytes_original_files":21001846,
"num_bytes_parquet_files":21001846,
"num_bytes_memory":356348071,
"num_rows":86241,
"num_columns":7
}
],
"splits":[
{
"dataset":"ibm/duorc",
"config":"ParaphraseRC",
"split":"train",
"num_bytes_parquet_files":26005668,
"num_bytes_memory":494389683,
"num_rows":69524,
"num_columns":7
},
{
"dataset":"ibm/duorc",
"config":"ParaphraseRC",
"split":"validation",
"num_bytes_parquet_files":5566868,
"num_bytes_memory":106733319,
"num_rows":15591,
"num_columns":7
},
{
"dataset":"ibm/duorc",
"config":"ParaphraseRC",
"split":"test",
"num_bytes_parquet_files":6136591,
"num_bytes_memory":103271281,
"num_rows":15857,
"num_columns":7
},
{
"dataset":"ibm/duorc",
"config":"SelfRC",
"split":"train",
"num_bytes_parquet_files":14851720,
"num_bytes_memory":248966361,
"num_rows":60721,
"num_columns":7
},
{
"dataset":"ibm/duorc",
"config":"SelfRC",
"split":"validation",
"num_bytes_parquet_files":3114390,
"num_bytes_memory":56359392,
"num_rows":12961,
"num_columns":7
},
{
"dataset":"ibm/duorc",
"config":"SelfRC",
"split":"test",
"num_bytes_parquet_files":3035736,
"num_bytes_memory":51022318,
"num_rows":12559,
"num_columns":7
}
]
},
"pending":[
],
"failed":[
],
"partial":false
}
如果大小具有 partial: true
,则表示无法确定数据集的实际大小,因为它太大。
在这种情况下,行数和字节数可能低于实际数字。
< > 在 GitHub 上更新