快速入门

在本快速入门中，您将学习如何使用 dataset viewer 的 REST API 来

检查 Hub 上的数据集是否正常运行。
返回数据集的子集和分割。
预览数据集的前 100 行。
下载数据集的行切片。
在数据集中搜索单词。
基于查询字符串过滤行。
以 parquet 文件形式访问数据集。
获取数据集大小（行数或字节数）。
获取关于数据集的统计信息。

API 端点

每个功能都通过下表总结的端点提供服务

端点	方法	描述	查询参数
/is-valid	GET	检查特定数据集是否有效。	`dataset`: 数据集名称
/splits	GET	获取数据集的子集和分割列表。	`dataset`: 数据集名称
/first-rows	GET	获取数据集分割的前几行。	- `dataset`: 数据集名称 - `config`: 配置名称 - `split`: 分割名称
/rows	GET	获取数据集分割的行切片。	- `dataset`: 数据集名称 - `config`: 配置名称 - `split`: 分割名称 - `offset`: 切片偏移量 - `length`: 切片长度（最大 100）
/search	GET	在数据集分割中搜索文本。	- `dataset`: 数据集名称 - `config`: 配置名称 - `split`: 分割名称 - `query`: 要搜索的文本
/filter	GET	在数据集分割中过滤行。	- `dataset`: 数据集名称 - `config`: 配置名称 - `split`: 分割名称 - `where`: 过滤查询 - `orderby`: 排序子句 - `offset`: 切片偏移量 - `length`: 切片长度（最大 100）
/parquet	GET	获取数据集的 parquet 文件列表。	`dataset`: 数据集名称
/size	GET	获取数据集的大小。	`dataset`: 数据集名称
/statistics	GET	获取关于数据集分割的统计信息。	- `dataset`: 数据集名称 - `config`: 配置名称 - `split`: 分割名称
/croissant	GET	获取关于数据集的 Croissant 元数据。	- `dataset`: 数据集名称

使用 dataset viewer API 无需安装或设置。

如果您还没有 Hugging Face 帐户，请注册一个！虽然您可以使用 dataset viewer API 而无需 Hugging Face 帐户，但如果您不提供用户令牌（可在用户设置中找到），您将无法访问 gated datasets，例如 CommonVoice 和 ImageNet。

欢迎在 Postman、ReDoc 或 RapidAPI 中试用 API。本快速入门将向您展示如何以编程方式查询端点。

REST API 的基本 URL 是

https://datasets-server.huggingface.co

私有和 gated 数据集

对于私有和 gated 数据集，您需要在查询的 headers 中提供您的用户令牌。否则，您将收到一条错误消息，提示您使用身份验证重试。

dataset viewer 支持 PRO 用户或 Enterprise Hub 组织拥有的私有数据集。

Python

JavaScript

cURL

如果您尝试在不提供用户令牌的情况下访问 gated 数据集，您将看到以下错误

print(data)
{'error': 'The dataset does not exist, or is not accessible without authentication (private or gated). Please check the spelling of the dataset name or retry with authentication.'}

检查数据集有效性

要检查特定数据集是否有效，例如 Rotten Tomatoes，请使用 /is-valid 端点

Python

JavaScript

cURL

这将返回数据集是否提供预览（请参阅 /first-rows）、查看器（请参阅 /rows）、搜索（请参阅 /search）和过滤器（请参阅 /filter）以及统计信息（请参阅 /statistics）

{ "preview": true, "viewer": true, "search": true, "filter": true, "statistics": true }

列出配置和分割

/splits 端点返回数据集中分割的 JSON 列表

Python

JavaScript

cURL

这将返回数据集中可用的子集和分割

{
  "splits": [
    { "dataset": "cornell-movie-review-data/rotten_tomatoes", "config": "default", "split": "train" },
    {
      "dataset": "cornell-movie-review-data/rotten_tomatoes",
      "config": "default",
      "split": "validation"
    },
    { "dataset": "cornell-movie-review-data/rotten_tomatoes", "config": "default", "split": "test" }
  ],
  "pending": [],
  "failed": []
}

预览数据集

/first-rows 端点返回数据集前 100 行的 JSON 列表。它还返回数据特征的类型（“columns”数据类型）。您应该指定要预览的数据集的数据集名称、子集名称（您可以从 /splits 端点找到子集名称）和分割名称

Python

JavaScript

cURL

这将返回数据集的前 100 行

{
  "dataset": "cornell-movie-review-data/rotten_tomatoes",
  "config": "default",
  "split": "train",
  "features": [
    {
      "feature_idx": 0,
      "name": "text",
      "type": { "dtype": "string", "_type": "Value" }
    },
    {
      "feature_idx": 1,
      "name": "label",
      "type": { "names": ["neg", "pos"], "_type": "ClassLabel" }
    }
  ],
  "rows": [
    {
      "row_idx": 0,
      "row": {
        "text": "the rock is destined to be the 21st century's new \" conan \" and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .",
        "label": 1
      },
      "truncated_cells": []
    },
    {
      "row_idx": 1,
      "row": {
        "text": "the gorgeously elaborate continuation of \" the lord of the rings \" trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .",
        "label": 1
      },
      "truncated_cells": []
    },
    ...,
    ...
  ]
}

下载数据集切片

/rows 端点返回数据集中任意给定位置（偏移量）的行切片的 JSON 列表。它还返回数据特征的类型（“columns”数据类型）。您应该指定数据集名称、子集名称（您可以从 /splits 端点找到子集名称）、分割名称以及您要下载的切片的偏移量和长度

Python

JavaScript

cURL

您一次最多可以下载 100 行的切片。

响应如下所示

{
  "features": [
    {
      "feature_idx": 0,
      "name": "text",
      "type": { "dtype": "string", "_type": "Value" }
    },
    {
      "feature_idx": 1,
      "name": "label",
      "type": { "names": ["neg", "pos"], "_type": "ClassLabel" }
    }
  ],
  "rows": [
    {
      "row_idx": 150,
      "row": {
        "text": "enormously likable , partly because it is aware of its own grasp of the absurd .",
        "label": 1
      },
      "truncated_cells": []
    },
    {
      "row_idx": 151,
      "row": {
        "text": "here's a british flick gleefully unconcerned with plausibility , yet just as determined to entertain you .",
        "label": 1
      },
      "truncated_cells": []
    },
    ...,
    ...
  ],
  "num_rows_total": 8530,
  "num_rows_per_page": 100,
  "partial": false
}

在数据集中搜索文本

/search 端点返回与文本查询匹配的数据集行切片的 JSON 列表。文本在类型为 string 的列中搜索，即使值嵌套在字典中也是如此。它还返回数据特征的类型（“columns”数据类型）。响应格式与 /rows 端点相同。您应该指定数据集名称、子集名称（您可以从 /splits 端点找到子集名称）、分割名称以及您想在文本列中查找的搜索查询

Python

JavaScript

cURL

您一次最多可以获取 100 行的切片，并且您可以像 /rows 端点一样，使用 offset 和 length 参数请求其他切片。

响应如下所示

{
  "features": [
    {
      "feature_idx": 0,
      "name": "text",
      "type": { "dtype": "string", "_type": "Value" }
    },
    {
      "feature_idx": 1,
      "name": "label",
      "type": { "dtype": "int64", "_type": "Value" }
    }
  ],
  "rows": [
    {
      "row_idx": 9,
      "row": {
        "text": "take care of my cat offers a refreshingly different slice of asian cinema .",
        "label": 1
      },
      "truncated_cells": []
    },
    {
      "row_idx": 472,
      "row": {
        "text": "[ \" take care of my cat \" ] is an honestly nice little film that takes us on an examination of young adult life in urban south korea through the hearts and minds of the five principals .",
        "label": 1
      },
      "truncated_cells": []
    },
    ...,
    ...
  ],
  "num_rows_total": 12,
  "num_rows_per_page": 100,
  "partial": false
}

访问 Parquet 文件

dataset viewer 将 Hub 上的每个数据集转换为 Parquet 格式。/parquet 端点返回数据集的 Parquet URL 的 JSON 列表

Python

JavaScript

cURL

这将为每个分割返回一个 Parquet 文件的 URL

{
  "parquet_files": [
    {
      "dataset": "cornell-movie-review-data/rotten_tomatoes",
      "config": "default",
      "split": "test",
      "url": "https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes/resolve/refs%2Fconvert%2Fparquet/default/test/0000.parquet",
      "filename": "0000.parquet",
      "size": 92206
    },
    {
      "dataset": "cornell-movie-review-data/rotten_tomatoes",
      "config": "default",
      "split": "train",
      "url": "https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet",
      "filename": "0000.parquet",
      "size": 698845
    },
    {
      "dataset": "cornell-movie-review-data/rotten_tomatoes",
      "config": "default",
      "split": "validation",
      "url": "https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes/resolve/refs%2Fconvert%2Fparquet/default/validation/0000.parquet",
      "filename": "0000.parquet",
      "size": 90001
    }
  ],
  "pending": [],
  "failed": [],
  "partial": false
}

获取数据集的大小

/size 端点返回一个 JSON，其中包含数据集的大小（行数和字节大小）以及每个子集和分割的大小

Python

JavaScript

cURL

这将返回数据集的大小，以及每个子集和分割的大小

{
  "size": {
    "dataset": {
      "dataset": "cornell-movie-review-data/rotten_tomatoes",
      "num_bytes_original_files": 487770,
      "num_bytes_parquet_files": 881052,
      "num_bytes_memory": 1345449,
      "num_rows": 10662
    },
    "configs": [
      {
        "dataset": "cornell-movie-review-data/rotten_tomatoes",
        "config": "default",
        "num_bytes_original_files": 487770,
        "num_bytes_parquet_files": 881052,
        "num_bytes_memory": 1345449,
        "num_rows": 10662,
        "num_columns": 2
      }
    ],
    "splits": [
      {
        "dataset": "cornell-movie-review-data/rotten_tomatoes",
        "config": "default",
        "split": "train",
        "num_bytes_parquet_files": 698845,
        "num_bytes_memory": 1074806,
        "num_rows": 8530,
        "num_columns": 2
      },
      {
        "dataset": "cornell-movie-review-data/rotten_tomatoes",
        "config": "default",
        "split": "validation",
        "num_bytes_parquet_files": 90001,
        "num_bytes_memory": 134675,
        "num_rows": 1066,
        "num_columns": 2
      },
      {
        "dataset": "cornell-movie-review-data/rotten_tomatoes",
        "config": "default",
        "split": "test",
        "num_bytes_parquet_files": 92206,
        "num_bytes_memory": 135968,
        "num_rows": 1066,
        "num_columns": 2
      }
    ]
  },
  "pending": [],
  "failed": [],
  "partial": false
}

< > 在 GitHub 上更新

Dataset viewer

快速入门

API 端点

私有和 gated 数据集

检查数据集有效性

列出配置和分割

预览数据集

下载数据集切片

在数据集中搜索文本

访问 Parquet 文件

获取数据集的大小