摘要
摘要会创建文档或文章的简短版本,其中包含所有重要信息。与翻译一样,它也是可以表述为序列到序列任务的另一个例子。摘要可以是
- 抽取式:从文档中提取最相关的信息。
- 抽象式:生成包含最相关信息的新文本。
本指南将向您展示如何
要查看与此任务兼容的所有架构和检查点,我们建议您查看 任务页面
在开始之前,请确保您已安装所有必要的库
pip install transformers datasets evaluate rouge_score
我们鼓励您登录您的 Hugging Face 帐户,以便您可以上传模型并与社区共享。出现提示时,输入您的令牌以登录
>>> from huggingface_hub import notebook_login
>>> notebook_login()
加载 BillSum 数据集
首先从 🤗 Datasets 库加载 BillSum 数据集的较小子集——加州州议案子集。
>>> from datasets import load_dataset
>>> billsum = load_dataset("billsum", split="ca_test")
使用 train_test_split 方法将数据集拆分为训练集和测试集。
>>> billsum = billsum.train_test_split(test_size=0.2)
然后查看一个示例。
>>> billsum["train"][0]
{'summary': 'Existing law authorizes state agencies to enter into contracts for the acquisition of goods or services upon approval by the Department of General Services. Existing law sets forth various requirements and prohibitions for those contracts, including, but not limited to, a prohibition on entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between spouses and domestic partners or same-sex and different-sex couples in the provision of benefits. Existing law provides that a contract entered into in violation of those requirements and prohibitions is void and authorizes the state or any person acting on behalf of the state to bring a civil action seeking a determination that a contract is in violation and therefore void. Under existing law, a willful violation of those requirements and prohibitions is a misdemeanor.\nThis bill would also prohibit a state agency from entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between employees on the basis of gender identity in the provision of benefits, as specified. By expanding the scope of a crime, this bill would impose a state-mandated local program.\nThe California Constitution requires the state to reimburse local agencies and school districts for certain costs mandated by the state. Statutory provisions establish procedures for making that reimbursement.\nThis bill would provide that no reimbursement is required by this act for a specified reason.',
'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 10295.35 is added to the Public Contract Code, to read:\n10295.35.\n(a) (1) Notwithstanding any other law, a state agency shall not enter into any contract for the acquisition of goods or services in the amount of one hundred thousand dollars ($100,000) or more with a contractor that, in the provision of benefits, discriminates between employees on the basis of an employee’s or dependent’s actual or perceived gender identity, including, but not limited to, the employee’s or dependent’s identification as transgender.\n(2) For purposes of this section, “contract” includes contracts with a cumulative amount of one hundred thousand dollars ($100,000) or more per contractor in each fiscal year.\n(3) For purposes of this section, an employee health plan is discriminatory if the plan is not consistent with Section 1365.5 of the Health and Safety Code and Section 10140 of the Insurance Code.\n(4) The requirements of this section shall apply only to those portions of a contractor’s operations that occur under any of the following conditions:\n(A) Within the state.\n(B) On real property outside the state if the property is owned by the state or if the state has a right to occupy the property, and if the contractor’s presence at that location is connected to a contract with the state.\n(C) Elsewhere in the United States where work related to a state contract is being performed.\n(b) Contractors shall treat as confidential, to the maximum extent allowed by law or by the requirement of the contractor’s insurance provider, any request by an employee or applicant for employment benefits or any documentation of eligibility for benefits submitted by an employee or applicant for employment.\n(c) After taking all reasonable measures to find a contractor that complies with this section, as determined by the state agency, the requirements of this section may be waived under any of the following circumstances:\n(1) There is only one prospective contractor willing to enter into a specific contract with the state agency.\n(2) The contract is necessary to respond to an emergency, as determined by the state agency, that endangers the public health, welfare, or safety, or the contract is necessary for the provision of essential services, and no entity that complies with the requirements of this section capable of responding to the emergency is immediately available.\n(3) The requirements of this section violate, or are inconsistent with, the terms or conditions of a grant, subvention, or agreement, if the agency has made a good faith attempt to change the terms or conditions of any grant, subvention, or agreement to authorize application of this section.\n(4) The contractor is providing wholesale or bulk water, power, or natural gas, the conveyance or transmission of the same, or ancillary services, as required for ensuring reliable services in accordance with good utility practice, if the purchase of the same cannot practically be accomplished through the standard competitive bidding procedures and the contractor is not providing direct retail services to end users.\n(d) (1) A contractor shall not be deemed to discriminate in the provision of benefits if the contractor, in providing the benefits, pays the actual costs incurred in obtaining the benefit.\n(2) If a contractor is unable to provide a certain benefit, despite taking reasonable measures to do so, the contractor shall not be deemed to discriminate in the provision of benefits.\n(e) (1) Every contract subject to this chapter shall contain a statement by which the contractor certifies that the contractor is in compliance with this section.\n(2) The department or other contracting agency shall enforce this section pursuant to its existing enforcement powers.\n(3) (A) If a contractor falsely certifies that it is in compliance with this section, the contract with that contractor shall be subject to Article 9 (commencing with Section 10420), unless, within a time period specified by the department or other contracting agency, the contractor provides to the department or agency proof that it has complied, or is in the process of complying, with this section.\n(B) The application of the remedies or penalties contained in Article 9 (commencing with Section 10420) to a contract subject to this chapter shall not preclude the application of any existing remedies otherwise available to the department or other contracting agency under its existing enforcement powers.\n(f) Nothing in this section is intended to regulate the contracting practices of any local jurisdiction.\n(g) This section shall be construed so as not to conflict with applicable federal laws, rules, or regulations. In the event that a court or agency of competent jurisdiction holds that federal law, rule, or regulation invalidates any clause, sentence, paragraph, or section of this code or the application thereof to any person or circumstances, it is the intent of the state that the court or agency sever that clause, sentence, paragraph, or section so that the remainder of this section shall remain in effect.\nSEC. 2.\nSection 10295.35 of the Public Contract Code shall not be construed to create any new enforcement authority or responsibility in the Department of General Services or any other contracting agency.\nSEC. 3.\nNo reimbursement is required by this act pursuant to Section 6 of Article XIII\u2009B of the California Constitution because the only costs that may be incurred by a local agency or school district will be incurred because this act creates a new crime or infraction, eliminates a crime or infraction, or changes the penalty for a crime or infraction, within the meaning of Section 17556 of the Government Code, or changes the definition of a crime within the meaning of Section 6 of Article XIII\u2009B of the California Constitution.',
'title': 'An act to add Section 10295.35 to the Public Contract Code, relating to public contracts.'}
有两个字段您需要使用:
text
:议案的文本,将作为模型的输入。summary
:text
的精简版本,将作为模型的目标。
预处理
下一步是加载 T5 分词器来处理 text
和 summary
。
>>> from transformers import AutoTokenizer
>>> checkpoint = "google-t5/t5-small"
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
您要创建的预处理函数需要:
- 在输入前添加提示,以便 T5 知道这是一个摘要任务。某些能够执行多个 NLP 任务的模型需要针对特定任务进行提示。
- 在对标签进行分词时使用关键字
text_target
参数。 - 将序列截断为不超过
max_length
参数设置的最大长度。
>>> prefix = "summarize: "
>>> def preprocess_function(examples):
... inputs = [prefix + doc for doc in examples["text"]]
... model_inputs = tokenizer(inputs, max_length=1024, truncation=True)
... labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)
... model_inputs["labels"] = labels["input_ids"]
... return model_inputs
要将预处理函数应用于整个数据集,请使用 🤗 Datasets 的 map 方法。您可以通过将 batched=True
设置为一次处理数据集的多个元素来加快 map
函数的速度。
>>> tokenized_billsum = billsum.map(preprocess_function, batched=True)
现在使用 DataCollatorForSeq2Seq 创建一批示例。在整理过程中,对句子进行 *动态填充* 到批次中的最长长度,而不是将整个数据集填充到最大长度,这样效率更高。
>>> from transformers import DataCollatorForSeq2Seq
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)
>>> from transformers import DataCollatorForSeq2Seq
>>> data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")
评估
在训练期间包含指标通常有助于评估模型的性能。您可以使用 🤗 Evaluate 库快速加载评估方法。对于此任务,请加载 ROUGE 指标(请参阅 🤗 Evaluate 的 快速入门,了解有关如何加载和计算指标的更多信息)。
>>> import evaluate
>>> rouge = evaluate.load("rouge")
然后创建一个函数,将您的预测和标签传递给 compute
以计算 ROUGE 指标。
>>> import numpy as np
>>> def compute_metrics(eval_pred):
... predictions, labels = eval_pred
... decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
... labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
... decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
... result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
... prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
... result["gen_len"] = np.mean(prediction_lens)
... return {k: round(v, 4) for k, v in result.items()}
您的 compute_metrics
函数现在已准备就绪,您将在设置训练时返回到它。
训练
您现在可以开始训练您的模型了!使用 AutoModelForSeq2SeqLM 加载 T5。
>>> from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
>>> model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
此时,只剩下三个步骤:
- 在 Seq2SeqTrainingArguments 中定义您的训练超参数。唯一必需的参数是
output_dir
,它指定保存模型的位置。您将通过设置push_to_hub=True
将此模型推送到 Hub(您需要登录 Hugging Face 才能上传您的模型)。在每个 epoch 结束时,Trainer 将评估 ROUGE 指标并保存训练检查点。 - 将训练参数传递给 Seq2SeqTrainer,以及模型、数据集、分词器、数据整理器和
compute_metrics
函数。 - 调用 train() 以微调您的模型。
>>> training_args = Seq2SeqTrainingArguments(
... output_dir="my_awesome_billsum_model",
... eval_strategy="epoch",
... learning_rate=2e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... weight_decay=0.01,
... save_total_limit=3,
... num_train_epochs=4,
... predict_with_generate=True,
... fp16=True, #change to bf16=True for XPU
... push_to_hub=True,
... )
>>> trainer = Seq2SeqTrainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_billsum["train"],
... eval_dataset=tokenized_billsum["test"],
... tokenizer=tokenizer,
... data_collator=data_collator,
... compute_metrics=compute_metrics,
... )
>>> trainer.train()
训练完成后,使用 push_to_hub() 方法将您的模型共享到 Hub,以便每个人都可以使用您的模型。
>>> trainer.push_to_hub()
如果您不熟悉使用 Keras 对模型进行微调,请查看此处的基本教程 这里!
>>> from transformers import create_optimizer, AdamWeightDecay
>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
然后,您可以使用 TFAutoModelForSeq2SeqLM 加载 T5。
>>> from transformers import TFAutoModelForSeq2SeqLM
>>> model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)
使用 prepare_tf_dataset() 将您的数据集转换为 tf.data.Dataset
格式。
>>> tf_train_set = model.prepare_tf_dataset(
... tokenized_billsum["train"],
... shuffle=True,
... batch_size=16,
... collate_fn=data_collator,
... )
>>> tf_test_set = model.prepare_tf_dataset(
... tokenized_billsum["test"],
... shuffle=False,
... batch_size=16,
... collate_fn=data_collator,
... )
使用 compile
配置模型以进行训练。请注意,Transformers 模型都具有默认的任务相关损失函数,因此您无需指定一个,除非您想指定。
>>> import tensorflow as tf
>>> model.compile(optimizer=optimizer) # No loss argument!
在开始训练之前,需要设置的最后两件事是从预测结果中计算 ROUGE 分数,并提供一种将模型推送到 Hub 的方法。这两种方法都是通过使用 Keras 回调 来完成的。
将您的 compute_metrics
函数传递给 KerasMetricCallback
>>> from transformers.keras_callbacks import KerasMetricCallback
>>> metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)
在 PushToHubCallback 中指定要将模型和分词器推送到哪里。
>>> from transformers.keras_callbacks import PushToHubCallback
>>> push_to_hub_callback = PushToHubCallback(
... output_dir="my_awesome_billsum_model",
... tokenizer=tokenizer,
... )
然后将您的回调捆绑在一起。
>>> callbacks = [metric_callback, push_to_hub_callback]
最后,您就可以开始训练您的模型了!使用您的训练集和验证集、epoch 数量以及您的回调调用 fit
以微调模型。
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=callbacks)
训练完成后,您的模型将自动上传到 Hub,以便每个人都可以使用它!
有关如何对模型进行微调以实现摘要的更深入示例,请查看相应的 PyTorch 笔记本 或 TensorFlow 笔记本。
推理
太好了,现在您已经微调了模型,就可以使用它进行推理了!
想出一些您想总结的文本。对于 T5,您需要根据您正在处理的任务在输入前添加前缀。对于摘要,您应该像下面所示那样在输入前添加前缀。
>>> text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."
尝试使用微调后的模型进行推理的最简单方法是在 pipeline() 中使用它。使用你的模型实例化一个用于摘要的pipeline
,并将你的文本传递给它。
>>> from transformers import pipeline
>>> summarizer = pipeline("summarization", model="username/my_awesome_billsum_model")
>>> summarizer(text)
[{"summary_text": "The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country."}]
如果你愿意,也可以手动复制pipeline
的结果。
将文本标记化并返回作为 PyTorch 张量的input_ids
。
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_billsum_model")
>>> inputs = tokenizer(text, return_tensors="pt").input_ids
使用 generate() 方法创建摘要。有关控制生成的各种文本生成策略和参数的更多详细信息,请查看 文本生成 API。
>>> from transformers import AutoModelForSeq2SeqLM
>>> model = AutoModelForSeq2SeqLM.from_pretrained("username/my_awesome_billsum_model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)
将生成的 token id 解码回文本。
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.'
将文本标记化并返回作为 TensorFlow 张量的input_ids
。
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("username/my_awesome_billsum_model")
>>> inputs = tokenizer(text, return_tensors="tf").input_ids
使用~transformers.generation_tf_utils.TFGenerationMixin.generate
方法创建摘要。有关控制生成的各种文本生成策略和参数的更多详细信息,请查看 文本生成 API。
>>> from transformers import TFAutoModelForSeq2SeqLM
>>> model = TFAutoModelForSeq2SeqLM.from_pretrained("username/my_awesome_billsum_model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)
将生成的 token id 解码回文本。
>>> tokenizer.decode(outputs[0], skip_special_tokens=True)
'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.'