PyArrow

Arrow 是一种面向列的数据格式和工具箱，用于快速数据交换和内存分析。由于 PyArrow 支持 fsspec 来读写远程数据，你可以使用 Hugging Face 路径 (hf://) 在 Hub 上读写数据。它对于 Parquet 数据特别有用，因为 Parquet 是 Hugging Face 上最常见的文件格式。事实上，Parquet 因其结构、类型、元数据和压缩而特别高效。

加载表格

你可以从本地文件或远程存储（如 Hugging Face 数据集）加载数据。PyArrow 支持多种格式，包括 CSV、JSON，更重要的是 Parquet

>>> import pyarrow.parquet as pq
>>> table = pq.read_table("path/to/data.parquet")

要从 Hugging Face 加载文件，路径需要以 hf:// 开头。例如，stanfordnlp/imdb 数据集仓库的路径是 hf://datasets/stanfordnlp/imdb。Hugging Face 上的数据集包含多个 Parquet 文件。Parquet 文件格式旨在高效读写数据帧，并使数据在数据分析语言之间轻松共享。以下是如何将文件 plain_text/train-00000-of-00001.parquet 加载为 pyarrow 表格（需要 pyarrow>=21.0）：

>>> import pyarrow.parquet as pq
>>> table = pq.read_table("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet")
>>> table
pyarrow.Table
text: string
label: int64
----
text: [["I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it (... 1542 chars omitted)", ...],...,[..., "The story centers around Barry McKenzie who must go to England if he wishes to claim his inheritan (... 221 chars omitted)"]]
label: [[0,0,0,0,0,...,0,0,0,0,0],...,[1,1,1,1,1,...,1,1,1,1,1]]

如果你不想加载完整的 Parquet 数据，可以获取 Parquet 元数据或按行组加载。

>>> import pyarrow.parquet as pq
>>> pf = pq.ParquetFile("hf://datasets/stanfordnlp/imdb/plain_text/train-00000-of-00001.parquet")
>>> pf.metadata
<pyarrow._parquet.FileMetaData object at 0x1171b4090>
  created_by: parquet-cpp-arrow version 12.0.0
  num_columns: 2
  num_rows: 25000
  num_row_groups: 25
  format_version: 2.6
  serialized_size: 62036
>>> for i in pf.num_row_groups:
...     table = pf.read_row_group(i)
...     ...

有关 Hugging Face 路径及其实现方式的更多信息，请参阅客户端库中关于 HfFileSystem 的文档。

保存表格

你可以使用 pyarrow.parquet.write_table 将 PyArrow Table 保存到本地文件或直接保存到 Hugging Face。

要将表格保存到 Hugging Face，您首先需要使用您的 Hugging Face 帐户登录，例如使用

hf auth login

然后，您可以创建一个数据集仓库，例如使用

from huggingface_hub import HfApi

HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset")

最后，您可以在 PyArrow 中使用Hugging Face 路径

import pyarrow.parquet as pq

pq.write_table(table, "hf://datasets/username/my_dataset/imdb.parquet", use_content_defined_chunking=True)

# or write in separate files if the dataset has train/validation/test splits
pq.write_table(table_train, "hf://datasets/username/my_dataset/train.parquet", use_content_defined_chunking=True)
pq.write_table(table_valid, "hf://datasets/username/my_dataset/validation.parquet", use_content_defined_chunking=True)
pq.write_table(table_test , "hf://datasets/username/my_dataset/test.parquet", use_content_defined_chunking=True)

我们使用 use_content_defined_chunking=True 来启用更快的 Hugging Face 上传和下载，这得益于 Xet 重复数据删除（需要 pyarrow>=21.0）。

内容定义分块（CDC）使 Parquet 写入器以一种使重复数据以相同方式分块和压缩的方式对数据页进行分块。如果没有 CDC，页面会被任意分块，因此由于压缩而无法检测到重复数据。多亏了 CDC，从 Hugging Face 上传和下载 Parquet 文件更快，因为重复数据只上传或下载一次。

有关 Xet 的更多信息请参见此处。

使用图像

您可以加载一个包含元数据文件的文件夹，其中包含图像名称或路径字段，结构如下：

Example 1:            Example 2:
folder/               folder/
├── metadata.parquet  ├── metadata.parquet
├── img000.png        └── images
├── img001.png            ├── img000.png
...                       ...
└── imgNNN.png            └── imgNNN.png

您可以像这样迭代图像路径：

from pathlib import Path
import pyarrow as pq

folder_path = Path("path/to/folder")
table = pq.read_table(folder_path + "metadata.parquet")
for file_name in table["file_name"].to_pylist():
    image_path = folder_path / file_name
    ...

由于数据集采用支持的结构（一个包含 file_name 字段的 metadata.parquet 文件），您可以将此数据集保存到 Hugging Face，并且数据集查看器会同时显示元数据和图像。

from huggingface_hub import HfApi
api = HfApi()

api.upload_folder(
    folder_path=folder_path,
    repo_id="username/my_image_dataset",
    repo_type="dataset",
)

将图像嵌入 Parquet 文件中

PyArrow 有一个二进制类型，允许在 Arrow 表格中包含图像字节。因此，它能够将数据集保存为一个包含图像（字节和路径）和样本元数据的单个 Parquet 文件

import pyarrow as pa
import pyarrow.parquet as pq

# Embed the image bytes in Arrow
image_array = pa.array([
    {
        "bytes": (folder_path / file_name).read_bytes(),
        "path": file_name,
    }
    for file_name in table["file_name"].to_pylist()
])
table.append_column("image", image_array)

# (Optional) Set the HF Image type for the Dataset Viewer and the `datasets` library
features = {"image": {"_type": "Image"}}  # or using datasets.Features(...).to_dict()
schema_metadata = {"huggingface": {"dataset_info": {"features": features}}}
table = table.replace_schema_metadata(schema_metadata)

# Save to Parquet
# (Optional) with use_content_defined_chunking for faster uploads and downloads
pq.write_table(table, "data.parquet", use_content_defined_chunking=True)

在 Arrow 模式元数据中设置图像类型允许其他库和 Hugging Face 数据集查看器知道“图像”包含图像而不是纯二进制数据。

使用音频

您可以加载一个包含元数据文件的文件夹，其中包含音频名称或路径字段，结构如下：

Example 1:            Example 2:
folder/               folder/
├── metadata.parquet  ├── metadata.parquet
├── rec000.wav        └── audios
├── rec001.wav            ├── rec000.wav
...                       ...
└── recNNN.wav            └── recNNN.wav

您可以像这样迭代音频路径：

from pathlib import Path
import pyarrow as pq

folder_path = Path("path/to/folder")
table = pq.read_table(folder_path + "metadata.parquet")
for file_name in table["file_name"].to_pylist():
    audio_path = folder_path / file_name
    ...

由于数据集采用支持的结构（一个包含 file_name 字段的 metadata.parquet 文件），您可以将其保存到 Hugging Face，并且 Hub 数据集查看器会同时显示元数据和音频。

from huggingface_hub import HfApi
api = HfApi()

api.upload_folder(
    folder_path=folder_path,
    repo_id="username/my_audio_dataset",
    repo_type="dataset",
)

将音频嵌入 Parquet 文件中

PyArrow 有一个二进制类型，允许在 Arrow 表格中包含音频字节。因此，它能够将数据集保存为一个包含音频（字节和路径）和样本元数据的单个 Parquet 文件。

import pyarrow as pa
import pyarrow.parquet as pq

# Embed the audio bytes in Arrow
audio_array = pa.array([
    {
        "bytes": (folder_path / file_name).read_bytes(),
        "path": file_name,
    }
    for file_name in table["file_name"].to_pylist()
])
table.append_column("audio", audio_array)

# (Optional) Set the HF Audio type for the Dataset Viewer and the `datasets` library
features = {"audio": {"_type": "Audio"}}  # or using datasets.Features(...).to_dict()
schema_metadata = {"huggingface": {"dataset_info": {"features": features}}}
table = table.replace_schema_metadata(schema_metadata)

# Save to Parquet
# (Optional) with use_content_defined_chunking for faster uploads and downloads
pq.write_table(table, "data.parquet", use_content_defined_chunking=True)

在 Arrow 模式元数据中设置音频类型允许其他库和 Hugging Face 数据集查看器识别“音频”包含音频数据，而不仅仅是二进制数据。

< > 在 GitHub 上更新