数据集查看器文档

获取行数和字节大小

Hugging Face's logo
加入 Hugging Face 社区

并获取增强的文档体验

开始

获取行数和字节大小

本指南向您展示如何使用数据集查看器的 /size 端点以编程方式检索数据集的大小。您也可以随意使用 ReDoc 试用它。

/size 端点接受数据集名称作为其查询参数

Python
JavaScript
cURL
import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/size?dataset=ibm/duorc"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

端点响应是一个 JSON,其中包含数据集的大小,以及每个子集和拆分的大小。它提供了不同形式数据的行数、列数(如果适用)和字节大小:原始文件、内存 (RAM) 大小和自动转换的 Parquet 文件。例如,ibm/duorc 数据集在其所有子集和拆分中共有 187,213 行,总大小为 97MB。

{
   "size":{
      "dataset":{
         "dataset":"ibm/duorc",
         "num_bytes_original_files":58710973,
         "num_bytes_parquet_files":58710973,
         "num_bytes_memory":1060742354,
         "num_rows":187213
      },
      "configs":[
         {
            "dataset":"ibm/duorc",
            "config":"ParaphraseRC",
            "num_bytes_original_files":37709127,
            "num_bytes_parquet_files":37709127,
            "num_bytes_memory":704394283,
            "num_rows":100972,
            "num_columns":7
         },
         {
            "dataset":"ibm/duorc",
            "config":"SelfRC",
            "num_bytes_original_files":21001846,
            "num_bytes_parquet_files":21001846,
            "num_bytes_memory":356348071,
            "num_rows":86241,
            "num_columns":7
         }
      ],
      "splits":[
         {
            "dataset":"ibm/duorc",
            "config":"ParaphraseRC",
            "split":"train",
            "num_bytes_parquet_files":26005668,
            "num_bytes_memory":494389683,
            "num_rows":69524,
            "num_columns":7
         },
         {
            "dataset":"ibm/duorc",
            "config":"ParaphraseRC",
            "split":"validation",
            "num_bytes_parquet_files":5566868,
            "num_bytes_memory":106733319,
            "num_rows":15591,
            "num_columns":7
         },
         {
            "dataset":"ibm/duorc",
            "config":"ParaphraseRC",
            "split":"test",
            "num_bytes_parquet_files":6136591,
            "num_bytes_memory":103271281,
            "num_rows":15857,
            "num_columns":7
         },
         {
            "dataset":"ibm/duorc",
            "config":"SelfRC",
            "split":"train",
            "num_bytes_parquet_files":14851720,
            "num_bytes_memory":248966361,
            "num_rows":60721,
            "num_columns":7
         },
         {
            "dataset":"ibm/duorc",
            "config":"SelfRC",
            "split":"validation",
            "num_bytes_parquet_files":3114390,
            "num_bytes_memory":56359392,
            "num_rows":12961,
            "num_columns":7
         },
         {
            "dataset":"ibm/duorc",
            "config":"SelfRC",
            "split":"test",
            "num_bytes_parquet_files":3035736,
            "num_bytes_memory":51022318,
            "num_rows":12559,
            "num_columns":7
         }
      ]
   },
   "pending":[
      
   ],
   "failed":[
      
   ],
   "partial":false
}

如果大小具有 partial: true,则表示无法确定数据集的实际大小,因为它太大了。

在这种情况下,行数和字节数可能小于实际数字。

< > 在 GitHub 上更新