处理

🤗 Datasets 提供了许多工具来修改数据集的结构和内容。这些工具对于整理数据集、创建附加列、在特征和格式之间转换以及更多功能都很重要。

本指南将向您展示如何：

重新排序行并拆分数据集。
重命名和删除列，以及其他常见的列操作。
对数据集中的每个示例应用处理函数。
连接数据集。
应用自定义格式转换。
保存和导出已处理的数据集。

有关处理其他数据集模式的更多详细信息，请参阅处理音频数据集指南、处理图像数据集指南或处理文本数据集指南。

本指南中的示例使用 MRPC 数据集，但欢迎加载您选择的任何数据集并进行操作！

>>> from datasets import load_dataset
>>> dataset = load_dataset("nyu-mll/glue", "mrpc", split="train")

本指南中的所有处理方法都返回一个新的 Dataset 对象。修改不是就地进行的。请注意不要覆盖您以前的数据集！

排序、打乱、选择、分割和分片

有几个函数可以重新排列数据集的结构。这些函数对于仅选择您想要的行、创建训练和测试拆分以及将非常大的数据集分片为更小的块很有用。

排序

使用 sort() 根据列值的数值对它们进行排序。提供的列必须与 NumPy 兼容。

>>> dataset["label"][:10]
[1, 0, 1, 0, 1, 1, 0, 1, 0, 0]
>>> sorted_dataset = dataset.sort("label")
>>> sorted_dataset["label"][:10]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> sorted_dataset["label"][-10:]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

在底层，这会创建一个根据列值排序的索引列表。然后使用此索引映射来访问底层 Arrow 表中的正确行。

打乱

shuffle() 函数随机重新排列列值。如果您想要更多地控制用于打乱数据集的算法，可以在此函数中指定 generator 参数以使用不同的 numpy.random.Generator。

>>> shuffled_dataset = sorted_dataset.shuffle(seed=42)
>>> shuffled_dataset["label"][:10]
[1, 1, 1, 0, 1, 1, 1, 1, 1, 0]

打乱操作将索引列表 [0:len(my_dataset)] 打乱以创建索引映射。然而，一旦您的 Dataset 具有索引映射，速度可能会慢 10 倍。这是因为需要额外一步来使用索引映射获取要读取的行索引，更重要的是，您不再读取连续的数据块。要恢复速度，您需要使用 Dataset.flatten_indices() 将整个数据集重新写入磁盘，这会移除索引映射。或者，您可以切换到 IterableDataset 并利用其快速近似打乱 IterableDataset.shuffle()

>>> iterable_dataset = dataset.to_iterable_dataset(num_shards=128)
>>> shuffled_iterable_dataset = iterable_dataset.shuffle(seed=42, buffer_size=1000)

选择和过滤

在数据集中过滤行有两种选择：select() 和 filter()。

select() 根据索引列表返回行。

>>> small_dataset = dataset.select([0, 10, 20, 30, 40, 50])
>>> len(small_dataset)
6

filter() 返回符合指定条件的行。

>>> start_with_ar = dataset.filter(lambda example: example["sentence1"].startswith("Ar"))
>>> len(start_with_ar)
6
>>> start_with_ar["sentence1"]
['Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
'Arison said Mann may have been one of the pioneers of the world music movement and he had a deep love of Brazilian music .',
'Arts helped coach the youth on an eighth-grade football team at Lombardi Middle School in Green Bay .',
'Around 9 : 00 a.m. EDT ( 1300 GMT ) , the euro was at $ 1.1566 against the dollar , up 0.07 percent on the day .',
"Arguing that the case was an isolated example , Canada has threatened a trade backlash if Tokyo 's ban is not justified on scientific grounds .",
'Artists are worried the plan would harm those who need help most - performers who have a difficult time lining up shows .'
]

如果您设置 with_indices=True，filter() 也可以按索引过滤。

>>> even_dataset = dataset.filter(lambda example, idx: idx % 2 == 0, with_indices=True)
>>> len(even_dataset)
1834
>>> len(dataset) / 2
1834.0

除非要保留的索引列表是连续的，否则这些方法也会在底层创建索引映射。

分割

train_test_split() 函数在您的数据集尚未具有训练集和测试集时，可以创建它们。这允许您调整每个拆分中的相对比例或绝对样本数量。在下面的示例中，使用 test_size 参数创建占原始数据集 10% 的测试集。

>>> dataset.train_test_split(test_size=0.1)
{'train': Dataset(schema: {'sentence1': 'string', 'sentence2': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 3301),
'test': Dataset(schema: {'sentence1': 'string', 'sentence2': 'string', 'label': 'int64', 'idx': 'int32'}, num_rows: 367)}
>>> 0.1 * len(dataset)
366.8

默认情况下，拆分是打乱的，但您可以设置 shuffle=False 以防止打乱。

分片

🤗 Datasets 支持分片，将非常大的数据集分成预定义数量的块。在 shard() 中指定 num_shards 参数以确定将数据集分割成多少个分片。您还需要使用 index 参数提供要返回的分片。

例如，stanfordnlp/imdb 数据集有 25000 个示例。

>>> from datasets import load_dataset
>>> dataset = load_dataset("stanfordnlp/imdb", split="train")
>>> print(dataset)
Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

将数据集分片为四个块后，第一个分片将只有 6250 个示例。

>>> dataset.shard(num_shards=4, index=0)
Dataset({
    features: ['text', 'label'],
    num_rows: 6250
})
>>> print(25000/4)
6250.0

重命名、删除、转换和展平

以下函数允许您修改数据集的列。这些函数对于重命名或删除列、将列更改为一组新特征以及展平嵌套列结构非常有用。

重命名

当您需要重命名数据集中的列时，使用 rename_column()。与原始列关联的特征实际上会移动到新的列名下，而不是仅仅就地替换原始列。

向 rename_column() 提供原始列的名称和新的列名。

>>> dataset
Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})
>>> dataset = dataset.rename_column("sentence1", "sentenceA")
>>> dataset = dataset.rename_column("sentence2", "sentenceB")
>>> dataset
Dataset({
    features: ['sentenceA', 'sentenceB', 'label', 'idx'],
    num_rows: 3668
})

删除

当您需要删除一个或多个列时，请将要删除的列名提供给 remove_columns() 函数。通过提供列名列表来删除多个列。

>>> dataset = dataset.remove_columns("label")
>>> dataset
Dataset({
    features: ['sentence1', 'sentence2', 'idx'],
    num_rows: 3668
})
>>> dataset = dataset.remove_columns(["sentence1", "sentence2"])
>>> dataset
Dataset({
    features: ['idx'],
    num_rows: 3668
})

相反，select_columns() 选择要保留的一个或多个列并删除其余列。此函数接受一个或一个列名列表。

>>> dataset
Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})
>>> dataset = dataset.select_columns(['sentence1', 'sentence2', 'idx'])
>>> dataset
Dataset({
    features: ['sentence1', 'sentence2', 'idx'],
    num_rows: 3668
})
>>> dataset = dataset.select_columns('idx')
>>> dataset
Dataset({
    features: ['idx'],
    num_rows: 3668
})

转换

cast() 函数转换一个或多个列的特征类型。此函数接受您的新 Features 作为其参数。下面的示例演示如何更改 ClassLabel 和 Value 特征。

>>> dataset.features
{'sentence1': Value('string'),
'sentence2': Value('string'),
'label': ClassLabel(names=['not_equivalent', 'equivalent']),
'idx': Value('int32')}

>>> from datasets import ClassLabel, Value
>>> new_features = dataset.features.copy()
>>> new_features["label"] = ClassLabel(names=["negative", "positive"])
>>> new_features["idx"] = Value("int64")
>>> dataset = dataset.cast(new_features)
>>> dataset.features
{'sentence1': Value('string'),
'sentence2': Value('string'),
'label': ClassLabel(names=['negative', 'positive']),
'idx': Value('int64')}

只有当原始特征类型和新特征类型兼容时，才能进行转换。例如，如果原始列只包含 1 和 0，您可以将特征类型为 Value("int32") 的列转换为 Value("bool")。

使用 cast_column() 函数更改单个列的特征类型。将列名及其新的特征类型作为参数传递。

>>> dataset.features
{'audio': Audio(sampling_rate=44100, mono=True)}

>>> dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
>>> dataset.features
{'audio': Audio(sampling_rate=16000, mono=True)}

展平

有时，列可以是多种类型的嵌套结构。请看下面 SQuAD 数据集的嵌套结构：

>>> from datasets import load_dataset
>>> dataset = load_dataset("rajpurkar/squad", split="train")
>>> dataset.features
{'id': Value('string'),
 'title': Value('string'),
 'context': Value('string'),
 'question': Value('string'),
 'answers': {'text': List(Value('string')),
  'answer_start': List(Value('int32'))}}

answers 字段包含两个子字段：text 和 answer_start。使用 flatten() 函数将子字段提取到它们自己的独立列中。

>>> flat_dataset = dataset.flatten()
>>> flat_dataset
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
 num_rows: 87599
})

请注意，子字段现在已成为独立的列：answers.text 和 answers.answer_start。

映射

🤗 Datasets 的一些更强大的应用来自于使用 map() 函数。map() 的主要目的是加快处理函数的速度。它允许您独立或批量地将处理函数应用于数据集中的每个示例。此函数甚至可以创建新的行和列。

在以下示例中，将数据集中的每个 sentence1 值都加上前缀 'My sentence: '。

首先创建一个函数，在每个句子的开头添加 'My sentence: '。该函数需要接受并输出一个 dict。

>>> def add_prefix(example):
...     example["sentence1"] = 'My sentence: ' + example["sentence1"]
...     return example

现在使用 map() 将 add_prefix 函数应用于整个数据集。

>>> updated_dataset = small_dataset.map(add_prefix)
>>> updated_dataset["sentence1"][:5]
['My sentence: Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
"My sentence: Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
'My sentence: They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added .',
'My sentence: Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 .',
]

我们再来看一个示例，但这次您将使用 map() 删除一列。当您删除一列时，它仅在该示例提供给映射函数后才被删除。这允许映射函数在使用列内容后再将其删除。

在 map() 中使用 remove_columns 参数指定要删除的列。

>>> updated_dataset = dataset.map(lambda example: {"new_sentence": example["sentence1"]}, remove_columns=["sentence1"])
>>> updated_dataset.column_names
['sentence2', 'label', 'idx', 'new_sentence']

🤗 Datasets 还有一个 remove_columns() 函数，它速度更快，因为它不复制剩余列的数据。

如果您设置了 with_indices=True，您也可以将 map() 与索引一起使用。下面的示例将索引添加到每个句子的开头。

>>> updated_dataset = dataset.map(lambda example, idx: {"sentence2": f"{idx}: " + example["sentence2"]}, with_indices=True)
>>> updated_dataset["sentence2"][:5]
['0: Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 "1: Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 .",
 "2: On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale .",
 '3: Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 .',
 '4: PG & E Corp. shares jumped $ 1.63 or 8 percent to $ 21.03 on the New York Stock Exchange on Friday .'
]

多进程

多进程处理通过在 CPU 上并行化进程来显著加快处理速度。在 map() 中设置 num_proc 参数以设置要使用的进程数。

>>> updated_dataset = dataset.map(lambda example, idx: {"sentence2": f"{idx}: " + example["sentence2"]}, with_indices=True, num_proc=4)

如果您设置 with_rank=True，map() 也可以与进程的 rank 一起工作。这类似于 with_indices 参数。映射函数中的 with_rank 参数位于 index 参数之后（如果它已存在）。

>>> import torch
>>> from multiprocess import set_start_method
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> from datasets import load_dataset
>>>
>>> # Get an example dataset
>>> dataset = load_dataset("fka/awesome-chatgpt-prompts", split="train")
>>>
>>> # Get an example model and its tokenizer
>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen1.5-0.5B-Chat").eval()
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-0.5B-Chat")
>>>
>>> def gpu_computation(batch, rank):
...     # Move the model on the right GPU if it's not there already
...     device = f"cuda:{(rank or 0) % torch.cuda.device_count()}"
...     model.to(device)
...
...     # Your big GPU call goes here, for example:
...     chats = [[
...         {"role": "system", "content": "You are a helpful assistant."},
...         {"role": "user", "content": prompt}
...     ] for prompt in batch["prompt"]]
...     texts = [tokenizer.apply_chat_template(
...         chat,
...         tokenize=False,
...         add_generation_prompt=True
...     ) for chat in chats]
...     model_inputs = tokenizer(texts, padding=True, return_tensors="pt").to(device)
...     with torch.no_grad():
...         outputs = model.generate(**model_inputs, max_new_tokens=512)
...     batch["output"] = tokenizer.batch_decode(outputs, skip_special_tokens=True)
...     return batch
>>>
>>> if __name__ == "__main__":
...     set_start_method("spawn")
...     updated_dataset = dataset.map(
...         gpu_computation,
...         batched=True,
...         batch_size=16,
...         with_rank=True,
...         num_proc=torch.cuda.device_count(),  # one process per GPU
...     )

rank 的主要用例是在多个 GPU 上并行计算。这需要设置 multiprocess.set_start_method("spawn")。如果您不这样做，您将收到以下 CUDA 错误。

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method.

批处理

map() 函数支持处理批量的示例。通过设置 batched=True 来对批次进行操作。默认批次大小为 1000，但您可以使用 batch_size 参数进行调整。批处理支持有趣的应用程序，例如将长句子分成较短的块和数据增强。

分割长示例

当示例过长时，您可能希望将它们分割成几个较小的块。首先创建一个函数，该函数：

将 sentence1 字段分割成 50 个字符的块。
将所有块堆叠在一起以创建新数据集。

>>> def chunk_examples(examples):
...     chunks = []
...     for sentence in examples["sentence1"]:
...         chunks += [sentence[i:i + 50] for i in range(0, len(sentence), 50)]
...     return {"chunks": chunks}

使用 map() 应用该函数。

>>> chunked_dataset = dataset.map(chunk_examples, batched=True, remove_columns=dataset.column_names)
>>> chunked_dataset[:10]
{'chunks': ['Amrozi accused his brother , whom he called " the ',
            'witness " , of deliberately distorting his evidenc',
            'e .',
            "Yucaipa owned Dominick 's before selling the chain",
            ' to Safeway in 1998 for $ 2.5 billion .',
            'They had published an advertisement on the Interne',
            't on June 10 , offering the cargo for sale , he ad',
            'ded .',
            'Around 0335 GMT , Tab shares were up 19 cents , or',
            ' 4.4 % , at A $ 4.56 , having earlier set a record']}

请注意，现在句子被分割成较短的块，并且数据集中的行数增加了。

>>> dataset
Dataset({
 features: ['sentence1', 'sentence2', 'label', 'idx'],
 num_rows: 3668
})
>>> chunked_dataset
Dataset({
    features: ['chunks'],
    num_rows: 10470
})

数据增强

map() 函数也可以用于数据增强。以下示例为句子中的遮蔽标记生成附加词。

在 🤗 Transformers 的 FillMaskPipeline 中加载并使用 RoBERTA 模型。

>>> from random import randint
>>> from transformers import pipeline

>>> fillmask = pipeline("fill-mask", model="roberta-base")
>>> mask_token = fillmask.tokenizer.mask_token
>>> smaller_dataset = dataset.filter(lambda e, i: i<100, with_indices=True)

创建一个函数，随机选择句子中要遮蔽的单词。该函数还应返回原始句子和 RoBERTA 生成的前两个替换词。

>>> def augment_data(examples):
...     outputs = []
...     for sentence in examples["sentence1"]:
...         words = sentence.split(' ')
...         K = randint(1, len(words)-1)
...         masked_sentence = " ".join(words[:K]  + [mask_token] + words[K+1:])
...         predictions = fillmask(masked_sentence)
...         augmented_sequences = [predictions[i]["sequence"] for i in range(3)]
...         outputs += [sentence] + augmented_sequences
...
...     return {"data": outputs}

使用 map() 将函数应用于整个数据集。

>>> augmented_dataset = smaller_dataset.map(augment_data, batched=True, remove_columns=dataset.column_names, batch_size=8)
>>> augmented_dataset[:9]["data"]
['Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'Amrozi accused his brother, whom he called " the witness ", of deliberately withholding his evidence.',
 'Amrozi accused his brother, whom he called " the witness ", of deliberately suppressing his evidence.',
 'Amrozi accused his brother, whom he called " the witness ", of deliberately destroying his evidence.',
 "Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion .",
 'Yucaipa owned Dominick Stores before selling the chain to Safeway in 1998 for $ 2.5 billion.',
 "Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $ 2.5 billion.",
 'Yucaipa owned Dominick Pizza before selling the chain to Safeway in 1998 for $ 2.5 billion.'
]

对于每个原始句子，RoBERTA 用三个替代词增强了一个随机词。原始词 distorting 被 withholding、suppressing 和 destroying 补充。

异步处理

异步函数对于并行调用 API 端点非常有用，例如下载图像或调用模型端点。

您可以使用 async 和 await 关键字定义一个异步函数，这是一个调用 Hugging Face 聊天模型的示例函数：

>>> import aiohttp
>>> import asyncio
>>> from huggingface_hub import get_token
>>> sem = asyncio.Semaphore(20)  # max number of simultaneous queries
>>> async def query_model(model, prompt):
...     api_url = f"https://api-inference.huggingface.co/models/{model}/v1/chat/completions"
...     headers = {"Authorization": f"Bearer {get_token()}", "Content-Type": "application/json"}
...     json = {"messages": [{"role": "user", "content": prompt}], "max_tokens": 20, "seed": 42}
...     async with sem, aiohttp.ClientSession() as session, session.post(api_url, headers=headers, json=json) as response:
...         output = await response.json()
...         return {"Output": output["choices"][0]["message"]["content"]}

异步函数并行运行，这大大加快了处理速度。如果按顺序运行相同的代码，则会花费更多时间，因为它在等待模型响应时什么也不做。通常建议在函数必须等待 API 响应时使用 async / await，或者如果它下载数据并且可能需要一些时间。

请注意 Semaphore 的存在：它设置了可以并行运行的最大查询数。建议在调用 API 时使用 Semaphore 以避免速率限制错误。

让我们用它来调用 microsoft/Phi-3-mini-4k-instruct 模型，并要求它返回 Maxwell-Jia/AIME_2024 数据集中每个数学问题的主要主题。

>>> from datasets import load_dataset
>>> ds = load_dataset("Maxwell-Jia/AIME_2024", split="train")
>>> model = "microsoft/Phi-3-mini-4k-instruct"
>>> prompt = 'What is this text mainly about ? Here is the text:\n\n```\n{Problem}\n```\n\nReply using one or two words max, e.g. "The main topic is Linear Algebra".'
>>> async def get_topic(example):
...     return await query_model(model, prompt.format(Problem=example['Problem']))
>>> ds = ds.map(get_topic)
>>> ds[0]
{'ID': '2024-II-4',
 'Problem': 'Let $x,y$ and $z$ be positive real numbers that...',
 'Solution': 'Denote $\\log_2(x) = a$, $\\log_2(y) = b$, and...,
 'Answer': 33,
 'Output': 'The main topic is Logarithms.'}

在这里，Dataset.map() 异步运行许多 get_topic 函数，因此它不必等待每个模型响应，这会按顺序花费大量时间。

默认情况下，Dataset.map() 最多并行运行一千个映射函数，因此不要忘记使用 Semaphore 设置可以并行运行的最大 API 调用数，否则模型可能会返回速率限制错误或过载。对于高级用例，您可以在 datasets.config 中更改并行查询的最大数量。

处理多个拆分

许多数据集都有拆分，可以使用 DatasetDict.map() 同时处理。例如，通过以下方式对训练集和测试集中的 sentence1 字段进行分词：

>>> from datasets import load_dataset

# load all the splits
>>> dataset = load_dataset('nyu-mll/glue', 'mrpc')
>>> encoded_dataset = dataset.map(lambda examples: tokenizer(examples["sentence1"]), batched=True)
>>> encoded_dataset["train"][0]
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0,
'input_ids': [  101,  7277,  2180,  5303,  4806,  1117,  1711,   117,  2292, 1119,  1270,   107,  1103,  7737,   107,   117,  1104,  9938, 4267, 12223, 21811,  1117,  2554,   119,   102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
}

分布式使用

当您在分布式设置中使用 map() 时，您还应该使用 torch.distributed.barrier。这可确保主进程执行映射，而其他进程加载结果，从而避免重复工作。

以下示例展示了如何使用 torch.distributed.barrier 同步进程：

>>> from datasets import Dataset
>>> import torch.distributed

>>> dataset1 = Dataset.from_dict({"a": [0, 1, 2]})

>>> if training_args.local_rank > 0:
...     print("Waiting for main process to perform the mapping")
...     torch.distributed.barrier()

>>> dataset2 = dataset1.map(lambda x: {"a": x["a"] + 1})

>>> if training_args.local_rank == 0:
...     print("Loading results from main process")
...     torch.distributed.barrier()

批处理

batch() 方法允许您将数据集中的样本分组到批次中。当您想要为训练或评估创建数据批次时，这特别有用，尤其是在使用深度学习模型时。

以下是使用 batch() 方法的示例：

>>> from datasets import load_dataset
>>> dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
>>> batched_dataset = dataset.batch(batch_size=4)
>>> batched_dataset[0]
{'text': ['the rock is destined to be the 21st century\'s new " conan " and that he\'s going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
        'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .',
        'effective but too-tepid biopic',
        'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'],
'label': [1, 1, 1, 1]}

batch() 方法接受以下参数：

batch_size (int)：每个批次中的样本数量。
drop_last_batch (bool，默认为 False)：如果数据集大小不能被批次大小整除，是否丢弃最后一个不完整的批次。
num_proc (int，可选，默认为 None)：用于多进程处理的进程数。如果为 None，则不使用多进程处理。这可以显著加快大型数据集的批处理速度。

请注意，Dataset.batch() 返回一个新 Dataset，其中每个项都是原始数据集中多个样本的批次。如果您想批量处理数据，您应该直接使用批量 map()，它将函数应用于批次，但输出数据集是非批量化的。

连接

如果它们共享相同的列类型，则可以连接单独的数据集。使用 concatenate_datasets() 连接数据集。

>>> from datasets import concatenate_datasets, load_dataset

>>> stories = load_dataset("ajibawa-2023/General-Stories-Collection", split="train")
>>> stories = stories.remove_columns([col for col in stories.column_names if col != "text"])  # only keep the 'text' column
>>> wiki = load_dataset("wikimedia/wikipedia", "20220301.en", split="train")
>>> wiki = wiki.remove_columns([col for col in wiki.column_names if col != "text"])  # only keep the 'text' column

>>> assert stories.features.type == wiki.features.type
>>> bert_dataset = concatenate_datasets([stories, wiki])

只要数据集具有相同的行数，您也可以通过设置 axis=1 来水平连接两个数据集。

>>> from datasets import Dataset
>>> stories_ids = Dataset.from_dict({"ids": list(range(len(stories)))})
>>> stories_with_ids = concatenate_datasets([stories, stories_ids], axis=1)

交错

您还可以通过从每个数据集中交替获取示例来将多个数据集混合在一起以创建新数据集。这被称为*交错*，由 interleave_datasets() 函数启用。interleave_datasets() 和 concatenate_datasets() 都适用于常规 Dataset 和 IterableDataset 对象。有关如何交错 IterableDataset 对象的示例，请参阅流式传输指南。

您可以为每个原始数据集定义采样概率，以指定如何交错数据集。在这种情况下，新数据集是通过从随机数据集中逐个获取示例来构建的，直到其中一个数据集用尽样本。

>>> from datasets import Dataset, interleave_datasets
>>> seed = 42
>>> probabilities = [0.3, 0.5, 0.2]
>>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12, 13]})
>>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=probabilities, seed=seed)
>>> dataset["a"]
[10, 11, 20, 12, 0, 21, 13]

您还可以指定 stopping_strategy。默认策略 first_exhausted 是一种子采样策略，即一旦其中一个数据集用尽样本，数据集构建就会停止。您可以指定 stopping_strategy=all_exhausted 以执行过采样策略。在这种情况下，数据集构建会在每个数据集中的所有样本至少添加一次后停止。实际上，这意味着如果一个数据集耗尽，它将返回到该数据集的开头，直到达到停止条件。请注意，如果没有指定采样概率，则新数据集将具有 max_length_datasets*nb_dataset 样本。

>>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12, 13]})
>>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
>>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")
>>> dataset["a"]
[0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 20]

格式

with_format() 函数实时应用自定义格式转换。此函数替换任何先前指定的格式。例如，您可以使用此函数实时分词和填充标记。分词仅在访问示例时应用。

例如，通过设置 type="torch" 来创建 PyTorch 张量：

>>> dataset = dataset.with_format(type="torch")

set_format() 函数也更改列的格式，只是它是就地运行的。

>>> dataset.set_format(type="torch")

如果您需要将数据集重置为原始格式，请将格式设置为 None（或使用 reset_format()）。

>>> dataset.format
{'type': 'torch', 'format_kwargs': {}, 'columns': [...], 'output_all_columns': False}
>>> dataset = dataset.with_format(None)
>>> dataset.format
{'type': None, 'format_kwargs': {}, 'columns': [...], 'output_all_columns': False}

张量格式

支持多种张量或数组格式。通常建议使用这些格式，而不是手动将数据集的输出转换为张量或数组，以避免不必要的数据复制并加快数据加载速度。

以下是支持的张量或数组格式列表：

NumPy：格式名称为“numpy”，更多信息请参阅使用 Datasets 与 NumPy
PyTorch：格式名称为“torch”，更多信息请参阅使用 Datasets 与 PyTorch
TensorFlow：格式名称为“tensorflow”，更多信息请参阅使用 Datasets 与 TensorFlow
JAX：格式名称为“jax”，更多信息请参阅使用 Datasets 与 JAX

有关如何高效创建 TensorFlow 数据集的更多详细信息，请查看使用 Datasets 与 TensorFlow 指南。

当数据集以张量或数组格式格式化时，所有数据都格式化为张量或数组（例如，PyTorch 不支持的字符串等类型除外）。

>>> ds = Dataset.from_dict({"text": ["foo", "bar"], "tokens": [[0, 1, 2], [3, 4, 5]]})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'text': 'foo', 'tokens': tensor([0, 1, 2])}
>>> ds[:2]
{'text': ['foo', 'bar'],
 'tokens': tensor([[0, 1, 2],
         [3, 4, 5]])}

表格格式

您可以使用数据框或表格格式来优化数据加载和数据处理，因为它们通常提供零拷贝操作和以低级语言编写的转换。

以下是支持的数据框或表格格式列表：

Pandas：格式名称为“pandas”，更多信息请参阅使用 Datasets 与 Pandas
Polars：格式名称为“polars”，更多信息请参阅使用 Datasets 与 Polars
PyArrow：格式名称为“arrow”，更多信息请参阅使用 Datasets 与 PyArrow

当数据集以数据框或表格格式格式化时，每个数据集行或行批次都格式化为数据框或表格，并且数据集列格式化为系列或数组。

>>> ds = Dataset.from_dict({"text": ["foo", "bar"], "label": [0, 1]})
>>> ds = ds.with_format("pandas")
>>> ds[:2]
  text  label
0  foo      0
1  bar      1

这些格式可以通过避免数据复制来更快地迭代数据，并且还可以在 map() 或 filter() 中实现更快的数据处理。

>>> ds = ds.map(lambda df: df.assign(upper_text=df.text.str.upper()), batched=True)
>>> ds[:2]
  text  label upper_text
0  foo      0        FOO
1  bar      1        BAR

自定义格式转换

with_transform() 函数实时应用自定义格式转换。此函数替换任何先前指定的格式。例如，您可以使用此函数实时分词和填充标记。分词仅在访问示例时应用。

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> def encode(batch):
...     return tokenizer(batch["sentence1"], batch["sentence2"], padding="longest", truncation=True, max_length=512, return_tensors="pt")
>>> dataset = dataset.with_transform(encode)
>>> dataset.format
{'type': 'custom', 'format_kwargs': {'transform': <function __main__.encode(batch)>}, 'columns': ['idx', 'label', 'sentence1', 'sentence2'], 'output_all_columns': False}

还有一个 set_transform() 函数，它执行相同的功能，但会就地运行。

您还可以使用 with_transform() 函数进行 Features 的自定义解码。

以下示例使用 pydub 包作为 torchcodec 解码的替代方案。

>>> import numpy as np
>>> from pydub import AudioSegment

>>> audio_dataset_amr = Dataset.from_dict({"audio": ["audio_samples/audio.amr"]})

>>> def decode_audio_with_pydub(batch, sampling_rate=16_000):
...     def pydub_decode_file(audio_path):
...         sound = AudioSegment.from_file(audio_path)
...         if sound.frame_rate != sampling_rate:
...             sound = sound.set_frame_rate(sampling_rate)
...         channel_sounds = sound.split_to_mono()
...         samples = [s.get_array_of_samples() for s in channel_sounds]
...         fp_arr = np.array(samples).T.astype(np.float32)
...         fp_arr /= np.iinfo(samples[0].typecode).max
...         return fp_arr
...
...     batch["audio"] = [pydub_decode_file(audio_path) for audio_path in batch["audio"]]
...     return batch

>>> audio_dataset_amr.set_transform(decode_audio_with_pydub)

保存

数据集准备好后，您可以将其保存为 Parquet 格式的 Hugging Face 数据集，然后使用 load_dataset() 稍后重新使用。

通过提供要保存到的 Hugging Face 数据集仓库的名称到 push_to_hub() 来保存您的数据集。

encoded_dataset.push_to_hub("username/my_dataset")

您可以使用多个进程并行上传，这在您想要加快速度时特别有用。

dataset.push_to_hub("username/my_dataset", num_proc=8)

使用 load_dataset() 函数重新加载数据集（流式或非流式）。

from datasets import load_dataset
reloaded_dataset = load_dataset("username/my_dataset", streaming=True)

另外，您可以将其本地保存为磁盘上的 Arrow 格式。与 Parquet 相比，Arrow 未压缩，这使得重新加载速度更快，非常适合本地磁盘使用和临时缓存。但由于它更大且元数据较少，因此上传/下载/查询比 Parquet 慢，并且不太适合长期存储。

使用 save_to_disk() 和 load_from_disk() 函数从磁盘重新加载数据集。

>>> encoded_dataset.save_to_disk("path/of/my/dataset/directory")
>>> # later
>>> from datasets import load_from_disk
>>> reloaded_dataset = load_from_disk("path/of/my/dataset/directory")

导出

🤗 Datasets 也支持导出，这样您就可以在其他应用程序中使用您的数据集。下表显示了当前支持的您可以导出的文件格式：

文件类型	导出方法
CSV	Dataset.to_csv()
JSON	Dataset.to_json()
Parquet	Dataset.to_parquet()
SQL	Dataset.to_sql()
内存中的 Python 对象	Dataset.to_pandas(), `Dataset.to_polars()` 或 Dataset.to_dict()

例如，将数据集导出为 CSV 文件，如下所示：

>>> encoded_dataset.to_csv("path/of/my/dataset.csv")

< > 在 GitHub 上更新