文本回归

使用 AutoTrain 训练文本回归模型非常容易！准备好格式正确的数据，然后只需点击几下，您就可以使用最先进的模型投入生产。

数据格式

让我们训练一个模型来对电影评论进行 1-5 分评分。数据应采用以下 CSV 格式

text,target
"this movie is great",5
"this movie is bad",1
.
.
.

如您所见，CSV 文件中有两列。一列是文本，另一列是标签。标签可以是任何浮点数或整数。

如果您的 CSV 文件很大，您可以将其分成多个 CSV 文件，并分别上传它们。请确保所有 CSV 文件的列名相同。

使用 pandas 分割 CSV 文件的一种方法如下所示

import pandas as pd

# Set the chunk size
chunk_size = 1000
i = 1

# Open the CSV file and read it in chunks
for chunk in pd.read_csv('example.csv', chunksize=chunk_size):
    # Save each chunk to a new file
    chunk.to_csv(f'chunk_{i}.csv', index=False)
    i += 1

除了 CSV 文件，您还可以使用 JSONL 格式。JSONL 格式应如下所示

{"text": "this movie is great", "target": 5}
{"text": "this movie is bad", "target": 1}
.
.
.

列

您的 CSV 数据集必须包含两列：text 和 target。

参数

❯ autotrain text-regression --help
usage: autotrain <command> [<args>] text-regression [-h] [--train] [--deploy] [--inference] [--username USERNAME]
                                                        [--backend {local-cli,spaces-a10gl,spaces-a10gs,spaces-a100,spaces-t4m,spaces-t4s,spaces-cpu,spaces-cpuf}]
                                                        [--token TOKEN] [--push-to-hub] --model MODEL --project-name PROJECT_NAME
                                                        [--data-path DATA_PATH] [--train-split TRAIN_SPLIT] [--valid-split VALID_SPLIT]
                                                        [--batch-size BATCH_SIZE] [--seed SEED] [--epochs EPOCHS]
                                                        [--gradient_accumulation GRADIENT_ACCUMULATION] [--disable_gradient_checkpointing] [--lr LR]
                                                        [--log {none,wandb,tensorboard}] [--text-column TEXT_COLUMN] [--target-column TARGET_COLUMN]
                                                        [--max-seq-length MAX_SEQ_LENGTH] [--warmup-ratio WARMUP_RATIO] [--optimizer OPTIMIZER]
                                                        [--scheduler SCHEDULER] [--weight-decay WEIGHT_DECAY] [--max-grad-norm MAX_GRAD_NORM]
                                                        [--logging-steps LOGGING_STEPS] [--eval-strategy {steps,epoch,no}]
                                                        [--save-total-limit SAVE_TOTAL_LIMIT]
                                                        [--auto-find-batch-size] [--mixed-precision {fp16,bf16,None}]

✨ Run AutoTrain Text Regression

options:
  -h, --help            show this help message and exit
  --train               Command to train the model
  --deploy              Command to deploy the model (limited availability)
  --inference           Command to run inference (limited availability)
  --username USERNAME   Hugging Face Hub Username
  --backend {local-cli,spaces-a10gl,spaces-a10gs,spaces-a100,spaces-t4m,spaces-t4s,spaces-cpu,spaces-cpuf}
                        Backend to use: default or spaces. Spaces backend requires push_to_hub & username. Advanced users only.
  --token TOKEN         Your Hugging Face API token. Token must have write access to the model hub.
  --push-to-hub         Push to hub after training will push the trained model to the Hugging Face model hub.
  --model MODEL         Base model to use for training
  --project-name PROJECT_NAME
                        Output directory / repo id for trained model (must be unique on hub)
  --data-path DATA_PATH
                        Train dataset to use. When using cli, this should be a directory path containing training and validation data in appropriate
                        formats
  --train-split TRAIN_SPLIT
                        Train dataset split to use
  --valid-split VALID_SPLIT
                        Validation dataset split to use
  --batch-size BATCH_SIZE
                        Training batch size to use
  --seed SEED           Random seed for reproducibility
  --epochs EPOCHS       Number of training epochs
  --gradient_accumulation GRADIENT_ACCUMULATION
                        Gradient accumulation steps
  --disable_gradient_checkpointing
                        Disable gradient checkpointing
  --lr LR               Learning rate
  --log {none,wandb,tensorboard}
                        Use experiment tracking
  --text-column TEXT_COLUMN
                        Specify the column name in the dataset that contains the text data. Useful for distinguishing between multiple text fields.
                        Default is 'text'.
  --target-column TARGET_COLUMN
                        Specify the column name that holds the target or label data for training. Helps in distinguishing different potential
                        outputs. Default is 'target'.
  --max-seq-length MAX_SEQ_LENGTH
                        Set the maximum sequence length (number of tokens) that the model should handle in a single input. Longer sequences are
                        truncated. Affects both memory usage and computational requirements. Default is 128 tokens.
  --warmup-ratio WARMUP_RATIO
                        Define the proportion of training to be dedicated to a linear warmup where learning rate gradually increases. This can help
                        in stabilizing the training process early on. Default ratio is 0.1.
  --optimizer OPTIMIZER
                        Choose the optimizer algorithm for training the model. Different optimizers can affect the training speed and model
                        performance. 'adamw_torch' is used by default.
  --scheduler SCHEDULER
                        Select the learning rate scheduler to adjust the learning rate based on the number of epochs. 'linear' decreases the
                        learning rate linearly from the initial lr set. Default is 'linear'. Try 'cosine' for a cosine annealing schedule.
  --weight-decay WEIGHT_DECAY
                        Set the weight decay rate to apply for regularization. Helps in preventing the model from overfitting by penalizing large
                        weights. Default is 0.0, meaning no weight decay is applied.
  --max-grad-norm MAX_GRAD_NORM
                        Specify the maximum norm of the gradients for gradient clipping. Gradient clipping is used to prevent the exploding gradient
                        problem in deep neural networks. Default is 1.0.
  --logging-steps LOGGING_STEPS
                        Determine how often to log training progress. Set this to the number of steps between each log output. -1 determines logging
                        steps automatically. Default is -1.
  --eval-strategy {steps,epoch,no}
                        Specify how often to evaluate the model performance. Options include 'no', 'steps', 'epoch'. 'epoch' evaluates at the end of
                        each training epoch by default.
  --save-total-limit SAVE_TOTAL_LIMIT
                        Limit the total number of model checkpoints to save. Helps manage disk space by retaining only the most recent checkpoints.
                        Default is to save only the latest one.
  --auto-find-batch-size
                        Enable automatic batch size determination based on your hardware capabilities. When set, it tries to find the largest batch
                        size that fits in memory.
  --mixed-precision {fp16,bf16,None}
                        Choose the precision mode for training to optimize performance and memory usage. Options are 'fp16', 'bf16', or None for
                        default precision. Default is None.

< > 在 GitHub 上更新