Transformers 文档

MaskFormer

Hugging Face's logo
加入 Hugging Face 社区

并获得增强的文档体验

开始使用

该模型于 2021-07-13 发布,并于 2022-03-02 添加到 Hugging Face Transformers。

MaskFormer

PyTorch

这是一个最近推出的模型,API 尚未经过广泛测试。未来可能会有一些错误或轻微的破坏性更改来修复它。如果您发现任何异常,请提交一个 Github Issue

概述

MaskFormer 模型由 Bowen Cheng、Alexander G. Schwing 和 Alexander Kirillov 在 Per-Pixel Classification is Not All You Need for Semantic Segmentation 中提出。MaskFormer 通过掩码分类范式来解决语义分割问题,而不是执行经典的像素级分类。

论文摘要如下:

现代方法通常将语义分割表述为逐像素分类任务,而实例级分割则通过另一种掩码分类方法处理。我们的关键见解是:掩码分类具有足够的通用性,可以使用完全相同的模型、损失和训练过程以统一的方式解决语义和实例级分割任务。基于这一观察,我们提出了 MaskFormer,这是一个简单的掩码分类模型,它预测一组二进制掩码,每个掩码都关联一个全局类别标签预测。总而言之,所提出的基于掩码分类的方法简化了语义和全景分割任务的有效方法,并显示出出色的经验结果。特别是,我们观察到当类别数量较多时,MaskFormer 优于逐像素分类基线。我们的基于掩码分类的方法在语义分割(ADE20K 上达到 55.6 mIoU)和全景分割(COCO 上达到 52.7 PQ)方面均优于当前的最新模型。

下图展示了 MaskFormer 的架构。摘自原始论文

该模型由 francesco 贡献。原始代码可在此处找到。

使用技巧

  • MaskFormer 的 Transformer 解码器与 DETR 的解码器相同。在训练期间,DETR 的作者发现使用解码器中的辅助损失有助于模型输出每个类别的正确对象数量。如果您将 MaskFormerConfig 的参数 use_auxiliary_loss 设置为 True,则在每个解码器层之后添加预测前馈神经网络和匈牙利损失(FFN 共享参数)。
  • 如果您想在多节点分布式环境中训练模型,则应更新 modeling_maskformer.pyMaskFormerLoss 类内的 get_num_masks 函数。如此处的原始实现所示,在多节点训练时,应将其设置为所有节点上目标掩码的平均数量。
  • 您可以使用 MaskFormerImageProcessor 来为模型准备图像和可选的目标。
  • 要获得最终分割结果,根据任务的不同,您可以调用 post_process_semantic_segmentation()post_process_panoptic_segmentation()。这两个任务都可以使用 MaskFormerForInstanceSegmentation 输出解决,全景分割接受一个可选的 label_ids_to_fuse 参数来融合目标对象(例如天空)的实例。

资源

图像分割
  • 所有展示 MaskFormer 推理和自定义数据微调的 Notebook 均可在此处找到。
  • 使用 TrainerAccelerate 微调 MaskFormer 的脚本可在此处找到。

MaskFormer 特定输出

class transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutput

< >

( encoder_last_hidden_state: torch.FloatTensor | None = None pixel_decoder_last_hidden_state: torch.FloatTensor | None = None transformer_decoder_last_hidden_state: torch.FloatTensor | None = None encoder_hidden_states: tuple[torch.FloatTensor] | None = None pixel_decoder_hidden_states: tuple[torch.FloatTensor] | None = None transformer_decoder_hidden_states: tuple[torch.FloatTensor] | None = None hidden_states: tuple[torch.FloatTensor] | None = None attentions: tuple[torch.FloatTensor] | None = None )

参数

  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — 编码器模型最后一阶段的最后隐藏状态(最终特征图)。
  • pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — 像素解码器模型(FPN)最后一阶段的最后隐藏状态(最终特征图)。
  • transformer_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Transformer 解码器模型最后一阶段的最后隐藏状态(最终特征图)。
  • encoder_hidden_states (tuple(torch.FloatTensor), optional, 当 output_hidden_states=True 传入或 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组(一个用于嵌入输出 + 每个阶段输出一个),形状为 (batch_size, num_channels, height, width)。编码器模型在每个阶段输出的隐藏状态(也称为特征图)。
  • pixel_decoder_hidden_states (tuple(torch.FloatTensor), optional, 当 output_hidden_states=True 传入或 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组(一个用于嵌入输出 + 每个阶段输出一个),形状为 (batch_size, num_channels, height, width)。像素解码器模型在每个阶段输出的隐藏状态(也称为特征图)。
  • transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional, 当 output_hidden_states=True 传入或 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组(一个用于嵌入输出 + 每个阶段输出一个),形状为 (batch_size, sequence_length, hidden_size)。transformer 解码器在每个阶段输出的隐藏状态(也称为特征图)。
  • hidden_states tuple(torch.FloatTensor), optional, 当 output_hidden_states=True 传入或 config.output_hidden_states=True 时返回) — 包含 encoder_hidden_statespixel_decoder_hidden_statesdecoder_hidden_statestorch.FloatTensor 元组。
  • hidden_states (tuple[torch.FloatTensor] | None.hidden_states, 当 output_hidden_states=True 传入或 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组(如果模型有嵌入层,则包含嵌入输出 + 每个层输出一个),形状为 (batch_size, sequence_length, hidden_size)

    模型在每个层输出的隐藏状态以及可选的初始嵌入输出。

  • attentions (tuple[torch.FloatTensor] | None.attentions, 当 output_attentions=True 传入或 config.output_attentions=True 时返回) — torch.FloatTensor 的元组(每个层一个),形状为 (batch_size, num_heads, sequence_length, sequence_length)

    注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。

用于 MaskFormerModel 输出的类。此类返回计算 logits 所需的所有隐藏状态。

class transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput

< >

( loss: torch.FloatTensor | None = None class_queries_logits: torch.FloatTensor | None = None masks_queries_logits: torch.FloatTensor | None = None auxiliary_logits: torch.FloatTensor | None = None encoder_last_hidden_state: torch.FloatTensor | None = None pixel_decoder_last_hidden_state: torch.FloatTensor | None = None transformer_decoder_last_hidden_state: torch.FloatTensor | None = None encoder_hidden_states: tuple[torch.FloatTensor] | None = None pixel_decoder_hidden_states: tuple[torch.FloatTensor] | None = None transformer_decoder_hidden_states: tuple[torch.FloatTensor] | None = None hidden_states: tuple[torch.FloatTensor] | None = None attentions: tuple[torch.FloatTensor] | None = None )

参数

  • loss (torch.Tensor, optional) — 计算出的损失,存在标签时返回。
  • class_queries_logits (torch.FloatTensor | None.class_queries_logits, 默认为 None) — 形状为 (batch_size, num_queries, num_labels + 1) 的张量,表示每个查询的建议类别。请注意,需要 + 1 是因为我们加入了空类别。
  • masks_queries_logits (torch.FloatTensor | None.masks_queries_logits, 默认为 None) — 形状为 (batch_size, num_queries, height, width) 的张量,表示每个查询的建议掩码。
  • auxiliary_logits (Dict[str, torch.FloatTensor], optional, 当 output_auxiliary_logits=True 时返回) — 字典包含每个解码器层的辅助预测,当辅助损失启用时返回。
  • encoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — 编码器模型最后一阶段的最后隐藏状态(最终特征图)。
  • pixel_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — 像素解码器模型(FPN)最后一阶段的最后隐藏状态(最终特征图)。
  • transformer_decoder_last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) — Transformer 解码器模型最后一阶段的最后隐藏状态(最终特征图)。
  • encoder_hidden_states (tuple(torch.FloatTensor), optional, 当 output_hidden_states=True 传入或 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组(一个用于嵌入输出 + 每个阶段输出一个),形状为 (batch_size, num_channels, height, width)。编码器模型在每个阶段输出的隐藏状态(也称为特征图)。
  • pixel_decoder_hidden_states (tuple(torch.FloatTensor), optional, 当 output_hidden_states=True 传入或 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组(一个用于嵌入输出 + 每个阶段输出一个),形状为 (batch_size, num_channels, height, width)。像素解码器模型在每个阶段输出的隐藏状态(也称为特征图)。
  • transformer_decoder_hidden_states (tuple(torch.FloatTensor), optional, 当 output_hidden_states=True 传入或 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组(一个用于嵌入输出 + 每个阶段输出一个),形状为 (batch_size, sequence_length, hidden_size)。transformer 解码器在每个阶段输出的隐藏状态。
  • hidden_states tuple(torch.FloatTensor), optional, 当 output_hidden_states=True 传入或 config.output_hidden_states=True 时返回) — 包含 encoder_hidden_statespixel_decoder_hidden_statesdecoder_hidden_statestorch.FloatTensor 元组。
  • hidden_states (tuple[torch.FloatTensor] | None.hidden_states, 当 output_hidden_states=True 传入或 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组(如果模型有嵌入层,则包含嵌入输出 + 每个层输出一个),形状为 (batch_size, sequence_length, hidden_size)

    模型在每个层输出的隐藏状态以及可选的初始嵌入输出。

  • attentions (tuple[torch.FloatTensor] | None.attentions, 当 output_attentions=True 传入或 config.output_attentions=True 时返回) — torch.FloatTensor 的元组(每个层一个),形状为 (batch_size, num_heads, sequence_length, sequence_length)

    注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。

用于 MaskFormerForInstanceSegmentation 输出的类。

根据任务的不同,此输出可以直接传递给 post_process_semantic_segmentation()post_process_instance_segmentation()post_process_panoptic_segmentation()。有关用法详情,请参阅 [`~MaskFormerImageProcessor]。

MaskFormerConfig

class transformers.MaskFormerConfig

< >

( fpn_feature_size: int = 256 mask_feature_size: int = 256 no_object_weight: float = 0.1 use_auxiliary_loss: bool = False backbone_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None decoder_config: dict | None = None init_std: float = 0.02 init_xavier_std: float = 1.0 dice_weight: float = 1.0 cross_entropy_weight: float = 1.0 mask_weight: float = 20.0 output_auxiliary_logits: bool | None = None **kwargs )

参数

  • mask_feature_size (int, optional, defaults to 256) — 掩码特征大小,此值也将用于指定特征金字塔网络特征的大小。
  • no_object_weight (float, optional, defaults to 0.1) — 应用于空(无对象)类别的权重。
  • use_auxiliary_loss(bool, optional, defaults to False) — If True MaskFormerForInstanceSegmentationOutput will contain the auxiliary losses computed using the logits from each decoder’s stage.
  • backbone_config (Union[dict, "PreTrainedConfig"], optional, defaults to SwinConfig()) — The configuration passed to the backbone, if unset, the configuration corresponding to swin-base-patch4-window12-384 will be used.
  • decoder_config (Dict, optional) — The configuration passed to the transformer decoder model, if unset the base config for detr-resnet-50 will be used.
  • init_std (float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
  • init_xavier_std (float, optional, defaults to 1) — The scaling factor used for the Xavier initialization gain in the HM Attention map module.
  • dice_weight (float, optional, defaults to 1.0) — The weight for the dice loss.
  • cross_entropy_weight (float, optional, defaults to 1.0) — The weight for the cross entropy loss.
  • mask_weight (float, optional, defaults to 20.0) — The weight for the mask loss.
  • output_auxiliary_logits (bool, optional) — Should the model output its auxiliary_logits or not.

引发

ValueError

  • ValueError — Raised if the backbone model type selected is not in ["swin"] or the decoder model type selected is not in ["detr"]

This is the configuration class to store the configuration of a MaskFormerModel. It is used to instantiate a MaskFormer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the MaskFormer facebook/maskformer-swin-base-ade architecture trained on ADE20k-150.

配置对象继承自 PreTrainedConfig,可用于控制模型输出。有关更多信息,请阅读 PreTrainedConfig 的文档。

Currently, MaskFormer only supports the Swin Transformer as backbone.

示例

>>> from transformers import MaskFormerConfig, MaskFormerModel

>>> # Initializing a MaskFormer facebook/maskformer-swin-base-ade configuration
>>> configuration = MaskFormerConfig()

>>> # Initializing a model (with random weights) from the facebook/maskformer-swin-base-ade style configuration
>>> model = MaskFormerModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

MaskFormerImageProcessor

class transformers.MaskFormerImageProcessor

< >

( do_resize: bool = True size: dict[str, int] | None = None size_divisor: int = 32 resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: float = 0.00392156862745098 do_normalize: bool = True image_mean: float | list[float] | None = None image_std: float | list[float] | None = None ignore_index: int | None = None do_reduce_labels: bool = False num_labels: int | None = None pad_size: dict[str, int] | None = None **kwargs )

参数

  • do_resize (bool, optional, defaults to True) — Whether to resize the input to a certain size.
  • size (int, optional, defaults to 800) — Resize the input to the given size. Only has an effect if do_resize is set to True. If size is a sequence like (width, height), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, if height > width, then image will be rescaled to (size * height / width, size).
  • size_divisor (int, optional, defaults to 32) — Some backbones need images divisible by a certain number. If not passed, it defaults to the value used in Swin Transformer.
  • resample (int, optional, defaults to Resampling.BILINEAR) — An optional resampling filter. This can be one of PIL.Image.Resampling.NEAREST, PIL.Image.Resampling.BOX, PIL.Image.Resampling.BILINEAR, PIL.Image.Resampling.HAMMING, PIL.Image.Resampling.BICUBIC or PIL.Image.Resampling.LANCZOS. Only has an effect if do_resize is set to True.
  • do_rescale (bool, optional, defaults to True) — Whether to rescale the input to a certain scale.
  • rescale_factor (float, optional, defaults to 1/ 255) — Rescale the input by the given factor. Only has an effect if do_rescale is set to True.
  • do_normalize (bool, optional, defaults to True) — Whether or not to normalize the input with mean and standard deviation.
  • image_mean (int, optional, defaults to [0.485, 0.456, 0.406]) — The sequence of means for each channel, to be used when normalizing images. Defaults to the ImageNet mean.
  • image_std (int, optional, defaults to [0.229, 0.224, 0.225]) — The sequence of standard deviations for each channel, to be used when normalizing images. Defaults to the ImageNet std.
  • ignore_index (int, optional) — Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels denoted with 0 (background) will be replaced with ignore_index.
  • do_reduce_labels (bool, optional, defaults to False) — Whether or not to decrement all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced by ignore_index.
  • num_labels (int, optional) — The number of labels in the segmentation map.
  • pad_size (Dict[str, int], optional) — The size {"height": int, "width" int} to pad the images to. Must be larger than any image size provided for preprocessing. If pad_size is not provided, images will be padded to the largest height and width in the batch.

Constructs a MaskFormer image processor. The image processor can be used to prepare image(s) and optional targets for the model.

This image processor inherits from BaseImageProcessor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

preprocess

< >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None instance_id_to_semantic_id: dict[int, int] | None = None do_resize: bool | None = None size: dict[str, int] | None = None size_divisor: int | None = None resample: PIL.Image.Resampling | None = None do_rescale: bool | None = None rescale_factor: float | None = None do_normalize: bool | None = None image_mean: float | list[float] | None = None image_std: float | list[float] | None = None ignore_index: int | None = None do_reduce_labels: bool | None = None return_tensors: str | transformers.utils.generic.TensorType | None = None data_format: str | transformers.image_utils.ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: str | transformers.image_utils.ChannelDimension | None = None pad_size: dict[str, int] | None = None )

编码输入

< >

( pixel_values_list: list segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None instance_id_to_semantic_id: list[dict[int, int]] | dict[int, int] | None = None ignore_index: int | None = None do_reduce_labels: bool = False return_tensors: str | transformers.utils.generic.TensorType | None = None input_data_format: str | transformers.image_utils.ChannelDimension | None = None pad_size: dict[str, int] | None = None ) BatchFeature

参数

  • pixel_values_list (list[ImageInput]) — List of images (pixel values) to be padded. Each image should be a tensor of shape (channels, height, width).
  • segmentation_maps (ImageInput, optional) — The corresponding semantic segmentation maps with the pixel-wise annotations.

    (bool, optional, defaults to True): Whether or not to pad images up to the largest image in a batch and create a pixel mask.

    If left to the default, will return a pixel mask that is:

    • 1 for pixels that are real (i.e. not masked),
    • 0 for pixels that are padding (i.e. masked).
  • instance_id_to_semantic_id (list[dict[int, int]] or dict[int, int], optional) — A mapping between object instance ids and class ids. If passed, segmentation_maps is treated as an instance segmentation map where each pixel represents an instance id. Can be provided as a single dictionary with a global/dataset-level mapping or as a list of dictionaries (one per image), to map instance ids in each image separately.
  • return_tensors (str or TensorType, optional) — If set, will return tensors instead of NumPy arrays. If set to 'pt', return PyTorch torch.Tensor objects.
  • pad_size (Dict[str, int], optional) — The size {"height": int, "width" int} to pad the images to. Must be larger than any image size provided for preprocessing. If pad_size is not provided, images will be padded to the largest height and width in the batch.

返回

批次特征

A BatchFeature with the following fields

  • pixel_values — 要输入到模型的像素值。
  • pixel_mask — 要输入到模型的像素掩码(当 =Truepixel_maskself.model_input_names 中时)。
  • mask_labels — 可选的掩码标签列表,形状为 (labels, height, width),用于输入到模型(当提供 annotations 时)。
  • class_labels — Optional list of class labels of shape (labels) to be fed to a model (when annotations are provided). They identify the labels of mask_labels, e.g. the label of mask_labels[i][j] if class_labels[i][j].

将图像填充到批处理中最大的图像,并创建相应的 pixel_mask

MaskFormer addresses semantic segmentation with a mask classification paradigm, thus input segmentation maps will be converted to lists of binary masks and their respective labels. Let’s see an example, assuming segmentation_maps = [[2,6,7,9]], the output will contain mask_labels = [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]] (four binary masks) and class_labels = [2,6,7,9], the labels for each mask.

post_process_semantic_segmentation

< >

( outputs target_sizes: list[tuple[int, int]] | None = None ) list[torch.Tensor]

参数

  • outputs (MaskFormerForInstanceSegmentation) — Raw outputs of the model.
  • target_sizes (list[tuple[int, int]], optional) — List of length (batch_size), where each list item (tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction. If left to None, predictions will not be resized.

返回

list[torch.Tensor]

一个长度为 batch_size 的列表,其中每个元素都是一个形状为 (height, width) 的语义分割图,对应于 target_sizes 条目(如果指定了 target_sizes)。每个 torch.Tensor 的条目对应一个语义类别 ID。

Converts the output of MaskFormerForInstanceSegmentation into semantic segmentation maps. Only supports PyTorch.

post_process_instance_segmentation

< >

( outputs threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 target_sizes: list[tuple[int, int]] | None = None return_coco_annotation: bool | None = False return_binary_maps: bool | None = False ) list[Dict]

参数

  • outputs (MaskFormerForInstanceSegmentation) — Raw outputs of the model.
  • threshold (float, optional, defaults to 0.5) — The probability score threshold to keep predicted instance masks.
  • mask_threshold (float, optional, defaults to 0.5) — Threshold to use when turning the predicted masks into binary values.
  • overlap_mask_area_threshold (float, optional, defaults to 0.8) — The overlap mask area threshold to merge or discard small disconnected parts within each binary instance mask.
  • target_sizes (list[Tuple], optional) — List of length (batch_size), where each list item (tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction. If left to None, predictions will not be resized.
  • return_coco_annotation (bool, optional, defaults to False) — If set to True, segmentation maps are returned in COCO run-length encoding (RLE) format.
  • return_binary_maps (bool, optional, defaults to False) — If set to True, segmentation maps are returned as a concatenated tensor of binary segmentation maps (one per detected instance).

返回

list[Dict]

字典列表,每个图像一个,每个字典包含两个键

  • segmentation — A tensor of shape (height, width) where each pixel represents a segment_id, or list[List] run-length encoding (RLE) of the segmentation map if return_coco_annotation is set to True, or a tensor of shape (num_instances, height, width) if return_binary_maps is set to True. Set to None if no mask if found above threshold.
  • segments_info — 包含每个段的附加信息的字典。
    • id — 表示 segment_id 的整数。
    • label_id — 表示与 segment_id 对应的标签/语义类 ID 的整数。
    • score — 具有 segment_id 的段的预测分数。

Converts the output of MaskFormerForInstanceSegmentationOutput into instance segmentation predictions. Only supports PyTorch. If instances could overlap, set either return_coco_annotation or return_binary_maps to True to get the correct segmentation result.

post_process_panoptic_segmentation

< >

( outputs threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 label_ids_to_fuse: set[int] | None = None target_sizes: list[tuple[int, int]] | None = None ) list[Dict]

参数

  • outputs (MaskFormerForInstanceSegmentationOutput) — The outputs from MaskFormerForInstanceSegmentation.
  • threshold (float, optional, defaults to 0.5) — The probability score threshold to keep predicted instance masks.
  • mask_threshold (float, optional, defaults to 0.5) — Threshold to use when turning the predicted masks into binary values.
  • overlap_mask_area_threshold (float, optional, defaults to 0.8) — The overlap mask area threshold to merge or discard small disconnected parts within each binary instance mask.
  • label_ids_to_fuse (Set[int], optional) — The labels in this state will have all their instances be fused together. For instance we could say there can only be one sky in an image, but several persons, so the label ID for sky would be in that set, but not the one for person.
  • target_sizes (list[Tuple], optional) — 长度为 (batch_size) 的列表,其中每个列表项 (tuple[int, int]]) 对应于批次中每个预测的请求最终大小(高度,宽度)。如果保留 None,则不会调整预测大小。

返回

list[Dict]

字典列表,每个图像一个,每个字典包含两个键

  • segmentation — 形状为 (height, width) 的张量,其中每个像素代表一个 segment_id,如果未找到高于 threshold 的掩码,则设置为 None。如果指定了 target_sizes,则分割大小会调整为相应的 target_sizes 条目。
  • segments_info — 包含每个段的附加信息的字典。
    • id — 表示 segment_id 的整数。
    • label_id — 表示与 segment_id 对应的标签/语义类 ID 的整数。
    • was_fused — 一个布尔值,如果 label_idlabel_ids_to_fuse 中则为 True,否则为 False。同一类别/标签的多个实例已融合并分配了一个单独的 segment_id
    • score — 具有 segment_id 的段的预测分数。

MaskFormerForInstanceSegmentationOutput 的输出转换为图像全景分割预测。仅支持 PyTorch。

MaskFormerImageProcessorFast

class transformers.MaskFormerImageProcessorFast

< >

( **kwargs: typing_extensions.Unpack[transformers.models.maskformer.image_processing_maskformer.MaskFormerImageProcessorKwargs] )

构造一个快速 Maskformer 图像处理器。

preprocess

< >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None instance_id_to_semantic_id: list[dict[int, int]] | dict[int, int] | None = None **kwargs: typing_extensions.Unpack[transformers.models.maskformer.image_processing_maskformer.MaskFormerImageProcessorKwargs] ) <class 'transformers.image_processing_base.BatchFeature'>

参数

  • images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list, list, list]) — 要预处理的图像。需要单个图像或批处理图像,像素值范围为 0 到 255。如果传入的图像像素值在 0 到 1 之间,请设置 do_rescale=False
  • segmentation_maps (ImageInput, optional) — 分割图。
  • instance_id_to_semantic_id (Union[list[dict[int, int]], dict[int, int]], optional) — 实例 ID 到语义 ID 的映射。
  • do_convert_rgb (bool | None.do_convert_rgb) — 是否将图像转换为 RGB。
  • do_resize (bool | None.do_resize) — 是否调整图像大小。
  • size (Annotated[int | list[int] | tuple[int, ...] | dict[str, int] | None, None]) — 描述了模型的最大输入维度。
  • crop_size (Annotated[int | list[int] | tuple[int, ...] | dict[str, int] | None, None]) — 应用 center_crop 后输出图像的大小。
  • resample (Annotated[Union[PILImageResampling, int, NoneType], None]) — 如果调整图像大小,则使用的重采样过滤器。可以是枚举 PILImageResampling 之一。仅当 do_resize 设置为 True 时才有效。
  • do_rescale (bool | None.do_rescale) — 是否重新缩放图像。
  • rescale_factor (float | None.rescale_factor) — 如果 do_rescale 设置为 True,则用于重新缩放图像的重新缩放因子。
  • do_normalize (bool | None.do_normalize) — 是否规范化图像。
  • image_mean (float | list[float] | tuple[float, ...] | None.image_mean) — 用于规范化的图像均值。仅当 do_normalize 设置为 True 时才有效。
  • image_std (float | list[float] | tuple[float, ...] | None.image_std) — 用于规范化的图像标准差。仅当 do_normalize 设置为 True 时才有效。
  • do_pad (bool | None.do_pad) — 是否填充图像。填充是针对批次中最大的大小进行的,或者针对每个图像的固定正方形大小进行的。具体的填充策略取决于模型。
  • pad_size (Annotated[int | list[int] | tuple[int, ...] | dict[str, int] | None, None]) — 要将图像填充到的大小,格式为 {"height": int, "width" int}。必须大于所提供的任何预处理图像的大小。如果未提供 pad_size,图像将填充到批次中最大的高度和宽度。仅当 do_pad=True 时应用。
  • do_center_crop (bool | None.do_center_crop) — 是否对图像进行中心裁剪。
  • data_format (str | ~image_utils.ChannelDimension | None.data_format) — 仅支持 ChannelDimension.FIRST。为与慢速处理器兼容而添加。
  • input_data_format (str | ~image_utils.ChannelDimension | None.input_data_format) — 输入图像的通道维度格式。如果未设置,则从输入图像中推断通道维度格式。可以是以下之一:
    • "channels_first"ChannelDimension.FIRST:图像格式为 (num_channels, height, width)。
    • "channels_last"ChannelDimension.LAST:图像格式为 (height, width, num_channels)。
    • "none"ChannelDimension.NONE:图像格式为 (height, width)。
  • device (Annotated[Union[str, torch.device, NoneType], None]) — 处理图像的设备。如果未设置,则从输入图像中推断设备。
  • return_tensors (Annotated[str | ~utils.generic.TensorType | None, None]) — 如果设置为 `pt`,则返回堆叠的张量,否则返回张量列表。
  • disable_grouping (bool | None.disable_grouping) — 是否禁用按大小对图像进行分组以单独处理而不是分批处理。如果为 None,则如果图像在 CPU 上,则设置为 True,否则设置为 False。此选择基于经验观察,详情请参阅此处:https://github.com/huggingface/transformers/pull/38157
  • image_seq_length (int | None.image_seq_length) — 输入中每张图像要使用的图像 token 数量。为向后兼容而添加,但将来应将其设置为处理器属性。
  • size_divisor (<class 'int'>.size_divisor) — 确保高度和宽度都可以被整除的大小。
  • ignore_index (int, optional) — 分割图中分配给背景像素的标签。如果提供,分割图中标记为 0(背景)的像素将替换为 ignore_index
  • do_reduce_labels (bool, optional, defaults to False) — 是否将分割图的所有标签值减 1。通常用于背景用 0 表示,且背景本身不包含在数据集的所有类别中的数据集(例如 ADE20k)。背景标签将替换为 ignore_index
  • num_labels (int, optional) — 分割图中的标签数量。

返回

<class 'transformers.image_processing_base.BatchFeature'>

  • data (dict) — 由 call 方法返回的列表/数组/张量字典(“pixel_values”等)。
  • tensor_type (Union[None, str, TensorType], optional) — 您可以在此处提供 tensor_type 以在初始化时将整数列表转换为 PyTorch/Numpy 张量。

post_process_semantic_segmentation

< >

( outputs target_sizes: list[tuple[int, int]] | None = None ) list[torch.Tensor]

参数

  • outputs (MaskFormerForInstanceSegmentation) — 模型的原始输出。
  • target_sizes (list[tuple[int, int]], optional) — 长度为 (batch_size) 的列表,其中每个列表项 (tuple[int, int]]) 对应于每个预测的请求最终大小(高度,宽度)。如果保留 None,则不会调整预测大小。

返回

list[torch.Tensor]

一个长度为 batch_size 的列表,其中每个元素都是一个形状为 (height, width) 的语义分割图,对应于 target_sizes 条目(如果指定了 target_sizes)。每个 torch.Tensor 的条目对应一个语义类别 ID。

Converts the output of MaskFormerForInstanceSegmentation into semantic segmentation maps. Only supports PyTorch.

post_process_instance_segmentation

< >

( outputs threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 target_sizes: list[tuple[int, int]] | None = None return_coco_annotation: bool | None = False return_binary_maps: bool | None = False ) list[Dict]

参数

  • outputs (MaskFormerForInstanceSegmentation) — 模型的原始输出。
  • threshold (float, optional, defaults to 0.5) — 保持预测实例掩码的概率得分阈值。
  • mask_threshold (float, optional, defaults to 0.5) — 将预测掩码转换为二进制值时使用的阈值。
  • overlap_mask_area_threshold (float, optional, defaults to 0.8) — 用于合并或丢弃每个二进制实例掩码中不相连的小部分的重叠掩码区域阈值。
  • target_sizes (list[Tuple], optional) — 长度为 (batch_size) 的列表,其中每个列表项 (tuple[int, int]]) 对应于每个预测的请求最终大小(高度,宽度)。如果保留 None,则不会调整预测大小。
  • return_coco_annotation (bool, optional, defaults to False) — 如果设置为 True,分割图将以 COCO 游程编码(RLE)格式返回。
  • return_binary_maps (bool, optional, defaults to False) — 如果设置为 True,分割图将作为二进制分割图的连接张量返回(每个检测到的实例一个)。

返回

list[Dict]

字典列表,每个图像一个,每个字典包含两个键

  • segmentation — A tensor of shape (height, width) where each pixel represents a segment_id, or list[List] run-length encoding (RLE) of the segmentation map if return_coco_annotation is set to True, or a tensor of shape (num_instances, height, width) if return_binary_maps is set to True. Set to None if no mask if found above threshold.
  • segments_info — 包含每个段的附加信息的字典。
    • id — 表示 segment_id 的整数。
    • label_id — 表示与 segment_id 对应的标签/语义类 ID 的整数。
    • score — 具有 segment_id 的段的预测分数。

Converts the output of MaskFormerForInstanceSegmentationOutput into instance segmentation predictions. Only supports PyTorch. If instances could overlap, set either return_coco_annotation or return_binary_maps to True to get the correct segmentation result.

post_process_panoptic_segmentation

< >

( outputs threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 label_ids_to_fuse: set[int] | None = None target_sizes: list[tuple[int, int]] | None = None ) list[Dict]

参数

  • outputs (MaskFormerForInstanceSegmentationOutput) — 来自 MaskFormerForInstanceSegmentation 的输出。
  • threshold (float, optional, defaults to 0.5) — 保持预测实例掩码的概率得分阈值。
  • mask_threshold (float, optional, defaults to 0.5) — 将预测掩码转换为二进制值时使用的阈值。
  • overlap_mask_area_threshold (float, optional, defaults to 0.8) — 用于合并或丢弃每个二进制实例掩码中不相连的小部分的重叠掩码区域阈值。
  • label_ids_to_fuse (Set[int], optional) — 在此状态下的标签的所有实例都将融合在一起。例如,我们可以说图像中只能有一个天空,但可以有几个人,因此天空的标签 ID 将在此集合中,而人的标签 ID 则不会。
  • target_sizes (list[Tuple], optional) — 长度为 (batch_size) 的列表,其中每个列表项 (tuple[int, int]]) 对应于批次中每个预测的请求最终大小(高度,宽度)。如果保留 None,则不会调整预测大小。

返回

list[Dict]

字典列表,每个图像一个,每个字典包含两个键

  • segmentation — 形状为 (height, width) 的张量,其中每个像素代表一个 segment_id,如果未找到高于 threshold 的掩码,则设置为 None。如果指定了 target_sizes,则分割大小会调整为相应的 target_sizes 条目。
  • segments_info — 包含每个段的附加信息的字典。
    • id — 表示 segment_id 的整数。
    • label_id — 表示与 segment_id 对应的标签/语义类 ID 的整数。
    • was_fused — 一个布尔值,如果 label_idlabel_ids_to_fuse 中则为 True,否则为 False。同一类别/标签的多个实例已融合并分配了一个单独的 segment_id
    • score — 具有 segment_id 的段的预测分数。

MaskFormerForInstanceSegmentationOutput 的输出转换为图像全景分割预测。仅支持 PyTorch。

MaskFormerModel

class transformers.MaskFormerModel

< >

( config: MaskFormerConfig )

参数

  • config (MaskFormerConfig) — 模型配置类,包含模型的所有参数。使用配置文件初始化时不会加载与模型相关的权重,只会加载配置。请查看 from_pretrained() 方法以加载模型权重。

没有特定顶层头部的裸 Maskformer 模型,输出原始隐藏状态。

此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。

此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。

forward

< >

( pixel_values: Tensor pixel_mask: torch.Tensor | None = None output_hidden_states: bool | None = None output_attentions: bool | None = None return_dict: bool | None = None **kwargs ) transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutput or tuple(torch.FloatTensor)

参数

  • pixel_values (torch.Tensor,形状为 (batch_size, num_channels, image_size, image_size)) — 对应于输入图像的张量。像素值可以使用 MaskFormerImageProcessorFast 获取。有关详细信息,请参阅 MaskFormerImageProcessorFast.__call__()processor_class 使用 MaskFormerImageProcessorFast 进行图像处理)。
  • pixel_mask (torch.Tensor,形状为 (batch_size, height, width)可选) — 用于避免在填充像素值上执行注意力的掩码。掩码值选择在 [0, 1] 之间:

    • 1 表示真实(即未被掩码)的像素,
    • 0 表示填充(即被掩码)的像素。

    什么是注意力掩码?

  • output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关详细信息,请参阅返回张量下的 hidden_states
  • output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关详细信息,请参阅返回张量下的 attentions
  • return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通的元组。

返回

transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutput or tuple(torch.FloatTensor)

一个 transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutput 对象或一个包含多个元素的 torch.FloatTensor 元组(如果传递了 return_dict=Falseconfig.return_dict=False),具体取决于配置 (MaskFormerConfig) 和输入。

  • encoder_last_hidden_state (torch.FloatTensor,形状为 (batch_size, num_channels, height, width)) — 编码器模型(主干网络)最后一阶段的最后隐藏状态(最终特征图)。

  • pixel_decoder_last_hidden_state (torch.FloatTensor,形状为 (batch_size, num_channels, height, width)) — 像素解码器模型(FPN)最后一阶段的最后隐藏状态(最终特征图)。

  • transformer_decoder_last_hidden_state (torch.FloatTensor,形状为 (batch_size, sequence_length, hidden_size)) — Transformer 解码器模型最后一阶段的最后隐藏状态(最终特征图)。

  • encoder_hidden_states (tuple(torch.FloatTensor), 可选,当传递 output_hidden_states=Trueconfig.output_hidden_states=True 时返回) — torch.FloatTensor 元组(一个用于嵌入输出 + 每个阶段输出一个),形状为 (batch_size, num_channels, height, width)。编码器模型在每个阶段输出的隐藏状态(也称为特征图)。

  • pixel_decoder_hidden_states (tuple(torch.FloatTensor), 可选,当传递 output_hidden_states=Trueconfig.output_hidden_states=True 时返回) — torch.FloatTensor 元组(一个用于嵌入输出 + 每个阶段输出一个),形状为 (batch_size, num_channels, height, width)。像素解码器模型在每个阶段输出的隐藏状态(也称为特征图)。

  • transformer_decoder_hidden_states (tuple(torch.FloatTensor), 可选,当传递 output_hidden_states=Trueconfig.output_hidden_states=True 时返回) — torch.FloatTensor 元组(一个用于嵌入输出 + 每个阶段输出一个),形状为 (batch_size, sequence_length, hidden_size)。Transformer 解码器在每个阶段输出的隐藏状态(也称为特征图)。

  • hidden_states tuple(torch.FloatTensor), 可选,当传递 output_hidden_states=Trueconfig.output_hidden_states=True 时返回) — 包含 encoder_hidden_statespixel_decoder_hidden_statesdecoder_hidden_statestorch.FloatTensor 元组

  • hidden_states (tuple[torch.FloatTensor] | None.hidden_states, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组(一个用于嵌入的输出,如果模型有嵌入层,+ 每个层的输出),形状为 (batch_size, sequence_length, hidden_size)

    模型在每个层输出的隐藏状态以及可选的初始嵌入输出。

  • attentions (tuple[torch.FloatTensor] | None.attentions, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 的元组(每个层一个),形状为 (batch_size, num_heads, sequence_length, sequence_length)

    注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。

MaskFormerModel 的 forward 方法,覆盖了 __call__ 特殊方法。

虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用 Module 实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。

示例

>>> from transformers import AutoImageProcessor, MaskFormerModel
>>> from PIL import Image
>>> import httpx
>>> from io import BytesIO

>>> # load MaskFormer fine-tuned on ADE20k semantic segmentation
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/maskformer-swin-base-ade")
>>> model = MaskFormerModel.from_pretrained("facebook/maskformer-swin-base-ade")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> with httpx.stream("GET", url) as response:
...     image = Image.open(BytesIO(response.read()))

>>> inputs = image_processor(image, return_tensors="pt")

>>> # forward pass
>>> outputs = model(**inputs)

>>> # the decoder of MaskFormer outputs hidden states of shape (batch_size, num_queries, hidden_size)
>>> transformer_decoder_last_hidden_state = outputs.transformer_decoder_last_hidden_state
>>> list(transformer_decoder_last_hidden_state.shape)
[1, 100, 256]

MaskFormerForInstanceSegmentation

class transformers.MaskFormerForInstanceSegmentation

< >

( config: MaskFormerConfig )

forward

< >

( pixel_values: Tensor mask_labels: list[torch.Tensor] | None = None class_labels: list[torch.Tensor] | None = None pixel_mask: torch.Tensor | None = None output_auxiliary_logits: bool | None = None output_hidden_states: bool | None = None output_attentions: bool | None = None return_dict: bool | None = None **kwargs ) transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput or tuple(torch.FloatTensor)

参数

  • pixel_values (torch.Tensor,形状为 (batch_size, num_channels, image_size, image_size)) — 对应于输入图像的张量。像素值可以使用 MaskFormerImageProcessorFast 获取。有关详细信息,请参阅 MaskFormerImageProcessorFast.__call__()processor_class 使用 MaskFormerImageProcessorFast 进行图像处理)。
  • mask_labels (list[torch.Tensor], 可选) — 形状为 (num_labels, height, width) 的掩码标签列表,用于馈送给模型
  • class_labels (list[torch.LongTensor], 可选) — 目标类别标签列表,形状为 (num_labels, height, width),用于馈送给模型。它们标识 mask_labels 的标签,例如 mask_labels[i][j] 的标签为 class_labels[i][j]
  • pixel_mask (torch.Tensor,形状为 (batch_size, height, width)可选) — 用于避免在填充像素值上执行注意力的掩码。掩码值选择在 [0, 1] 之间:

    • 1 表示真实(即未被掩码)的像素,
    • 0 表示填充(即被掩码)的像素。

    什么是注意力掩码?

  • output_auxiliary_logits (bool, 可选) — 是否输出辅助 logits。
  • output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。有关详细信息,请参阅返回张量下的 hidden_states
  • output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。有关详细信息,请参阅返回张量下的 attentions
  • return_dict (bool, 可选) — 是否返回 ModelOutput 而不是普通的元组。

返回

transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput or tuple(torch.FloatTensor)

一个 transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput 对象或一个包含多个元素的 torch.FloatTensor 元组(如果传递了 return_dict=Falseconfig.return_dict=False),具体取决于配置 (MaskFormerConfig) 和输入。

  • loss (torch.Tensor, 可选) — 计算出的损失,当存在标签时返回。

  • class_queries_logits (torch.FloatTensor | None.class_queries_logits, 默认为 None) — 形状为 (batch_size, num_queries, num_labels + 1) 的张量,表示每个查询的建议类别。请注意,需要 + 1 是因为我们包含了空类别。

  • masks_queries_logits (torch.FloatTensor | None.masks_queries_logits, 默认为 None) — 形状为 (batch_size, num_queries, height, width) 的张量,表示每个查询的建议掩码。

  • auxiliary_logits (Dict[str, torch.FloatTensor], 可选,当传递 output_auxiliary_logits=True 时返回) — 包含辅助损失启用时每个解码器层的辅助预测的字典。

  • encoder_last_hidden_state (torch.FloatTensor,形状为 (batch_size, num_channels, height, width)) — 编码器模型(主干网络)最后一阶段的最后隐藏状态(最终特征图)。

  • pixel_decoder_last_hidden_state (torch.FloatTensor,形状为 (batch_size, num_channels, height, width)) — 像素解码器模型(FPN)最后一阶段的最后隐藏状态(最终特征图)。

  • transformer_decoder_last_hidden_state (torch.FloatTensor,形状为 (batch_size, sequence_length, hidden_size)) — Transformer 解码器模型最后一阶段的最后隐藏状态(最终特征图)。

  • encoder_hidden_states (tuple(torch.FloatTensor), 可选,当传递 output_hidden_states=Trueconfig.output_hidden_states=True 时返回) — torch.FloatTensor 元组(一个用于嵌入输出 + 每个阶段输出一个),形状为 (batch_size, num_channels, height, width)。编码器模型在每个阶段输出的隐藏状态(也称为特征图)。

  • pixel_decoder_hidden_states (tuple(torch.FloatTensor), 可选,当传递 output_hidden_states=Trueconfig.output_hidden_states=True 时返回) — torch.FloatTensor 元组(一个用于嵌入输出 + 每个阶段输出一个),形状为 (batch_size, num_channels, height, width)。像素解码器模型在每个阶段输出的隐藏状态(也称为特征图)。

  • transformer_decoder_hidden_states (tuple(torch.FloatTensor), 可选,当传递 output_hidden_states=Trueconfig.output_hidden_states=True 时返回) — torch.FloatTensor 元组(一个用于嵌入输出 + 每个阶段输出一个),形状为 (batch_size, sequence_length, hidden_size)。Transformer 解码器在每个阶段输出的隐藏状态。

  • hidden_states tuple(torch.FloatTensor), 可选,当传递 output_hidden_states=Trueconfig.output_hidden_states=True 时返回) — 包含 encoder_hidden_statespixel_decoder_hidden_statesdecoder_hidden_statestorch.FloatTensor 元组。

  • hidden_states (tuple[torch.FloatTensor] | None.hidden_states, 当传递 output_hidden_states=True 或当 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组(一个用于嵌入的输出,如果模型有嵌入层,+ 每个层的输出),形状为 (batch_size, sequence_length, hidden_size)

    模型在每个层输出的隐藏状态以及可选的初始嵌入输出。

  • attentions (tuple[torch.FloatTensor] | None.attentions, 当传递 output_attentions=True 或当 config.output_attentions=True 时返回) — torch.FloatTensor 的元组(每个层一个),形状为 (batch_size, num_heads, sequence_length, sequence_length)

    注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。

MaskFormerForInstanceSegmentation 的 forward 方法,覆盖了 __call__ 特殊方法。

虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用 Module 实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。

示例

语义分割示例

>>> from transformers import AutoImageProcessor, MaskFormerForInstanceSegmentation
>>> from PIL import Image
>>> import httpx
>>> from io import BytesIO

>>> # load MaskFormer fine-tuned on ADE20k semantic segmentation
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/maskformer-swin-base-ade")
>>> model = MaskFormerForInstanceSegmentation.from_pretrained("facebook/maskformer-swin-base-ade")

>>> url = (
...     "https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg"
... )
>>> with httpx.stream("GET", url) as response:
...     image = Image.open(BytesIO(response.read()))
>>> inputs = image_processor(images=image, return_tensors="pt")

>>> outputs = model(**inputs)
>>> # model predicts class_queries_logits of shape `(batch_size, num_queries)`
>>> # and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
>>> class_queries_logits = outputs.class_queries_logits
>>> masks_queries_logits = outputs.masks_queries_logits

>>> # you can pass them to image_processor for postprocessing
>>> predicted_semantic_map = image_processor.post_process_semantic_segmentation(
...     outputs, target_sizes=[(image.height, image.width)]
... )[0]

>>> # we refer to the demo notebooks for visualization (see "Resources" section in the MaskFormer docs)
>>> list(predicted_semantic_map.shape)
[512, 683]

全景分割示例

>>> from transformers import AutoImageProcessor, MaskFormerForInstanceSegmentation
>>> from PIL import Image
>>> import httpx
>>> from io import BytesIO

>>> # load MaskFormer fine-tuned on COCO panoptic segmentation
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/maskformer-swin-base-coco")
>>> model = MaskFormerForInstanceSegmentation.from_pretrained("facebook/maskformer-swin-base-coco")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> with httpx.stream("GET", url) as response:
...     image = Image.open(BytesIO(response.read()))
>>> inputs = image_processor(images=image, return_tensors="pt")

>>> outputs = model(**inputs)
>>> # model predicts class_queries_logits of shape `(batch_size, num_queries)`
>>> # and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
>>> class_queries_logits = outputs.class_queries_logits
>>> masks_queries_logits = outputs.masks_queries_logits

>>> # you can pass them to image_processor for postprocessing
>>> result = image_processor.post_process_panoptic_segmentation(outputs, target_sizes=[(image.height, image.width)])[0]

>>> # we refer to the demo notebooks for visualization (see "Resources" section in the MaskFormer docs)
>>> predicted_panoptic_map = result["segmentation"]
>>> list(predicted_panoptic_map.shape)
[480, 640]
在 GitHub 上更新

© . This site is unofficial and not affiliated with Hugging Face, Inc.