Transformers 文档
MaskFormer
并获得增强的文档体验
开始使用
该模型于 2021-07-13 发布,并于 2022-03-02 添加到 Hugging Face Transformers。
MaskFormer
这是一个最近推出的模型,API 尚未经过广泛测试。未来可能会有一些错误或轻微的破坏性更改来修复它。如果您发现任何异常,请提交一个 Github Issue。
概述
MaskFormer 模型由 Bowen Cheng、Alexander G. Schwing 和 Alexander Kirillov 在 Per-Pixel Classification is Not All You Need for Semantic Segmentation 中提出。MaskFormer 通过掩码分类范式来解决语义分割问题,而不是执行经典的像素级分类。
论文摘要如下:
现代方法通常将语义分割表述为逐像素分类任务,而实例级分割则通过另一种掩码分类方法处理。我们的关键见解是:掩码分类具有足够的通用性,可以使用完全相同的模型、损失和训练过程以统一的方式解决语义和实例级分割任务。基于这一观察,我们提出了 MaskFormer,这是一个简单的掩码分类模型,它预测一组二进制掩码,每个掩码都关联一个全局类别标签预测。总而言之,所提出的基于掩码分类的方法简化了语义和全景分割任务的有效方法,并显示出出色的经验结果。特别是,我们观察到当类别数量较多时,MaskFormer 优于逐像素分类基线。我们的基于掩码分类的方法在语义分割(ADE20K 上达到 55.6 mIoU)和全景分割(COCO 上达到 52.7 PQ)方面均优于当前的最新模型。
下图展示了 MaskFormer 的架构。摘自原始论文。
使用技巧
- MaskFormer 的 Transformer 解码器与 DETR 的解码器相同。在训练期间,DETR 的作者发现使用解码器中的辅助损失有助于模型输出每个类别的正确对象数量。如果您将 MaskFormerConfig 的参数
use_auxiliary_loss设置为True,则在每个解码器层之后添加预测前馈神经网络和匈牙利损失(FFN 共享参数)。 - 如果您想在多节点分布式环境中训练模型,则应更新
modeling_maskformer.py中MaskFormerLoss类内的get_num_masks函数。如此处的原始实现所示,在多节点训练时,应将其设置为所有节点上目标掩码的平均数量。 - 您可以使用 MaskFormerImageProcessor 来为模型准备图像和可选的目标。
- 要获得最终分割结果,根据任务的不同,您可以调用 post_process_semantic_segmentation() 或 post_process_panoptic_segmentation()。这两个任务都可以使用 MaskFormerForInstanceSegmentation 输出解决,全景分割接受一个可选的
label_ids_to_fuse参数来融合目标对象(例如天空)的实例。
资源
- 所有展示 MaskFormer 推理和自定义数据微调的 Notebook 均可在此处找到。
- 使用 Trainer 或 Accelerate 微调
MaskFormer的脚本可在此处找到。
MaskFormer 特定输出
class transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutput
< source >( encoder_last_hidden_state: torch.FloatTensor | None = None pixel_decoder_last_hidden_state: torch.FloatTensor | None = None transformer_decoder_last_hidden_state: torch.FloatTensor | None = None encoder_hidden_states: tuple[torch.FloatTensor] | None = None pixel_decoder_hidden_states: tuple[torch.FloatTensor] | None = None transformer_decoder_hidden_states: tuple[torch.FloatTensor] | None = None hidden_states: tuple[torch.FloatTensor] | None = None attentions: tuple[torch.FloatTensor] | None = None )
参数
- encoder_last_hidden_state (
torch.FloatTensorof shape(batch_size, num_channels, height, width)) — 编码器模型最后一阶段的最后隐藏状态(最终特征图)。 - pixel_decoder_last_hidden_state (
torch.FloatTensorof shape(batch_size, num_channels, height, width)) — 像素解码器模型(FPN)最后一阶段的最后隐藏状态(最终特征图)。 - transformer_decoder_last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Transformer 解码器模型最后一阶段的最后隐藏状态(最终特征图)。 - encoder_hidden_states (
tuple(torch.FloatTensor), optional, 当output_hidden_states=True传入或config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入输出 + 每个阶段输出一个),形状为(batch_size, num_channels, height, width)。编码器模型在每个阶段输出的隐藏状态(也称为特征图)。 - pixel_decoder_hidden_states (
tuple(torch.FloatTensor), optional, 当output_hidden_states=True传入或config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入输出 + 每个阶段输出一个),形状为(batch_size, num_channels, height, width)。像素解码器模型在每个阶段输出的隐藏状态(也称为特征图)。 - transformer_decoder_hidden_states (
tuple(torch.FloatTensor), optional, 当output_hidden_states=True传入或config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入输出 + 每个阶段输出一个),形状为(batch_size, sequence_length, hidden_size)。transformer 解码器在每个阶段输出的隐藏状态(也称为特征图)。 - hidden_states
tuple(torch.FloatTensor), optional, 当output_hidden_states=True传入或config.output_hidden_states=True时返回) — 包含encoder_hidden_states、pixel_decoder_hidden_states和decoder_hidden_states的torch.FloatTensor元组。 - hidden_states (
tuple[torch.FloatTensor] | None.hidden_states, 当output_hidden_states=True传入或config.output_hidden_states=True时返回) —torch.FloatTensor的元组(如果模型有嵌入层,则包含嵌入输出 + 每个层输出一个),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
- attentions (
tuple[torch.FloatTensor] | None.attentions, 当output_attentions=True传入或config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
用于 MaskFormerModel 输出的类。此类返回计算 logits 所需的所有隐藏状态。
class transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput
< source >( loss: torch.FloatTensor | None = None class_queries_logits: torch.FloatTensor | None = None masks_queries_logits: torch.FloatTensor | None = None auxiliary_logits: torch.FloatTensor | None = None encoder_last_hidden_state: torch.FloatTensor | None = None pixel_decoder_last_hidden_state: torch.FloatTensor | None = None transformer_decoder_last_hidden_state: torch.FloatTensor | None = None encoder_hidden_states: tuple[torch.FloatTensor] | None = None pixel_decoder_hidden_states: tuple[torch.FloatTensor] | None = None transformer_decoder_hidden_states: tuple[torch.FloatTensor] | None = None hidden_states: tuple[torch.FloatTensor] | None = None attentions: tuple[torch.FloatTensor] | None = None )
参数
- loss (
torch.Tensor, optional) — 计算出的损失,存在标签时返回。 - class_queries_logits (
torch.FloatTensor | None.class_queries_logits, 默认为None) — 形状为(batch_size, num_queries, num_labels + 1)的张量,表示每个查询的建议类别。请注意,需要+ 1是因为我们加入了空类别。 - masks_queries_logits (
torch.FloatTensor | None.masks_queries_logits, 默认为None) — 形状为(batch_size, num_queries, height, width)的张量,表示每个查询的建议掩码。 - auxiliary_logits (
Dict[str, torch.FloatTensor], optional, 当output_auxiliary_logits=True时返回) — 字典包含每个解码器层的辅助预测,当辅助损失启用时返回。 - encoder_last_hidden_state (
torch.FloatTensorof shape(batch_size, num_channels, height, width)) — 编码器模型最后一阶段的最后隐藏状态(最终特征图)。 - pixel_decoder_last_hidden_state (
torch.FloatTensorof shape(batch_size, num_channels, height, width)) — 像素解码器模型(FPN)最后一阶段的最后隐藏状态(最终特征图)。 - transformer_decoder_last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Transformer 解码器模型最后一阶段的最后隐藏状态(最终特征图)。 - encoder_hidden_states (
tuple(torch.FloatTensor), optional, 当output_hidden_states=True传入或config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入输出 + 每个阶段输出一个),形状为(batch_size, num_channels, height, width)。编码器模型在每个阶段输出的隐藏状态(也称为特征图)。 - pixel_decoder_hidden_states (
tuple(torch.FloatTensor), optional, 当output_hidden_states=True传入或config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入输出 + 每个阶段输出一个),形状为(batch_size, num_channels, height, width)。像素解码器模型在每个阶段输出的隐藏状态(也称为特征图)。 - transformer_decoder_hidden_states (
tuple(torch.FloatTensor), optional, 当output_hidden_states=True传入或config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入输出 + 每个阶段输出一个),形状为(batch_size, sequence_length, hidden_size)。transformer 解码器在每个阶段输出的隐藏状态。 - hidden_states
tuple(torch.FloatTensor), optional, 当output_hidden_states=True传入或config.output_hidden_states=True时返回) — 包含encoder_hidden_states、pixel_decoder_hidden_states和decoder_hidden_states的torch.FloatTensor元组。 - hidden_states (
tuple[torch.FloatTensor] | None.hidden_states, 当output_hidden_states=True传入或config.output_hidden_states=True时返回) —torch.FloatTensor的元组(如果模型有嵌入层,则包含嵌入输出 + 每个层输出一个),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
- attentions (
tuple[torch.FloatTensor] | None.attentions, 当output_attentions=True传入或config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
用于 MaskFormerForInstanceSegmentation 输出的类。
根据任务的不同,此输出可以直接传递给 post_process_semantic_segmentation() 或 post_process_instance_segmentation() 或 post_process_panoptic_segmentation()。有关用法详情,请参阅 [`~MaskFormerImageProcessor]。
MaskFormerConfig
class transformers.MaskFormerConfig
< source >( fpn_feature_size: int = 256 mask_feature_size: int = 256 no_object_weight: float = 0.1 use_auxiliary_loss: bool = False backbone_config: dict | transformers.configuration_utils.PreTrainedConfig | None = None decoder_config: dict | None = None init_std: float = 0.02 init_xavier_std: float = 1.0 dice_weight: float = 1.0 cross_entropy_weight: float = 1.0 mask_weight: float = 20.0 output_auxiliary_logits: bool | None = None **kwargs )
参数
- mask_feature_size (
int, optional, defaults to 256) — 掩码特征大小,此值也将用于指定特征金字塔网络特征的大小。 - no_object_weight (
float, optional, defaults to 0.1) — 应用于空(无对象)类别的权重。 - use_auxiliary_loss(
bool, optional, defaults toFalse) — IfTrueMaskFormerForInstanceSegmentationOutputwill contain the auxiliary losses computed using the logits from each decoder’s stage. - backbone_config (
Union[dict, "PreTrainedConfig"], optional, defaults toSwinConfig()) — The configuration passed to the backbone, if unset, the configuration corresponding toswin-base-patch4-window12-384will be used. - decoder_config (
Dict, optional) — The configuration passed to the transformer decoder model, if unset the base config fordetr-resnet-50will be used. - init_std (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - init_xavier_std (
float, optional, defaults to 1) — The scaling factor used for the Xavier initialization gain in the HM Attention map module. - dice_weight (
float, optional, defaults to 1.0) — The weight for the dice loss. - cross_entropy_weight (
float, optional, defaults to 1.0) — The weight for the cross entropy loss. - mask_weight (
float, optional, defaults to 20.0) — The weight for the mask loss. - output_auxiliary_logits (
bool, optional) — Should the model output itsauxiliary_logitsor not.
引发
ValueError
ValueError— Raised if the backbone model type selected is not in["swin"]or the decoder model type selected is not in["detr"]
This is the configuration class to store the configuration of a MaskFormerModel. It is used to instantiate a MaskFormer model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the MaskFormer facebook/maskformer-swin-base-ade architecture trained on ADE20k-150.
配置对象继承自 PreTrainedConfig,可用于控制模型输出。有关更多信息,请阅读 PreTrainedConfig 的文档。
Currently, MaskFormer only supports the Swin Transformer as backbone.
示例
>>> from transformers import MaskFormerConfig, MaskFormerModel
>>> # Initializing a MaskFormer facebook/maskformer-swin-base-ade configuration
>>> configuration = MaskFormerConfig()
>>> # Initializing a model (with random weights) from the facebook/maskformer-swin-base-ade style configuration
>>> model = MaskFormerModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configMaskFormerImageProcessor
class transformers.MaskFormerImageProcessor
< source >( do_resize: bool = True size: dict[str, int] | None = None size_divisor: int = 32 resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: float = 0.00392156862745098 do_normalize: bool = True image_mean: float | list[float] | None = None image_std: float | list[float] | None = None ignore_index: int | None = None do_reduce_labels: bool = False num_labels: int | None = None pad_size: dict[str, int] | None = None **kwargs )
参数
- do_resize (
bool, optional, defaults toTrue) — Whether to resize the input to a certainsize. - size (
int, optional, defaults to 800) — Resize the input to the given size. Only has an effect ifdo_resizeis set toTrue. If size is a sequence like(width, height), output size will be matched to this. If size is an int, smaller edge of the image will be matched to this number. i.e, ifheight > width, then image will be rescaled to(size * height / width, size). - size_divisor (
int, optional, defaults to 32) — Some backbones need images divisible by a certain number. If not passed, it defaults to the value used in Swin Transformer. - resample (
int, optional, defaults toResampling.BILINEAR) — An optional resampling filter. This can be one ofPIL.Image.Resampling.NEAREST,PIL.Image.Resampling.BOX,PIL.Image.Resampling.BILINEAR,PIL.Image.Resampling.HAMMING,PIL.Image.Resampling.BICUBICorPIL.Image.Resampling.LANCZOS. Only has an effect ifdo_resizeis set toTrue. - do_rescale (
bool, optional, defaults toTrue) — Whether to rescale the input to a certainscale. - rescale_factor (
float, optional, defaults to1/ 255) — Rescale the input by the given factor. Only has an effect ifdo_rescaleis set toTrue. - do_normalize (
bool, optional, defaults toTrue) — Whether or not to normalize the input with mean and standard deviation. - image_mean (
int, optional, defaults to[0.485, 0.456, 0.406]) — The sequence of means for each channel, to be used when normalizing images. Defaults to the ImageNet mean. - image_std (
int, optional, defaults to[0.229, 0.224, 0.225]) — The sequence of standard deviations for each channel, to be used when normalizing images. Defaults to the ImageNet std. - ignore_index (
int, optional) — Label to be assigned to background pixels in segmentation maps. If provided, segmentation map pixels denoted with 0 (background) will be replaced withignore_index. - do_reduce_labels (
bool, optional, defaults toFalse) — Whether or not to decrement all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced byignore_index. - num_labels (
int, optional) — The number of labels in the segmentation map. - pad_size (
Dict[str, int], optional) — The size{"height": int, "width" int}to pad the images to. Must be larger than any image size provided for preprocessing. Ifpad_sizeis not provided, images will be padded to the largest height and width in the batch.
Constructs a MaskFormer image processor. The image processor can be used to prepare image(s) and optional targets for the model.
This image processor inherits from BaseImageProcessor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None instance_id_to_semantic_id: dict[int, int] | None = None do_resize: bool | None = None size: dict[str, int] | None = None size_divisor: int | None = None resample: PIL.Image.Resampling | None = None do_rescale: bool | None = None rescale_factor: float | None = None do_normalize: bool | None = None image_mean: float | list[float] | None = None image_std: float | list[float] | None = None ignore_index: int | None = None do_reduce_labels: bool | None = None return_tensors: str | transformers.utils.generic.TensorType | None = None data_format: str | transformers.image_utils.ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: str | transformers.image_utils.ChannelDimension | None = None pad_size: dict[str, int] | None = None )
编码输入
< source >( pixel_values_list: list segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None instance_id_to_semantic_id: list[dict[int, int]] | dict[int, int] | None = None ignore_index: int | None = None do_reduce_labels: bool = False return_tensors: str | transformers.utils.generic.TensorType | None = None input_data_format: str | transformers.image_utils.ChannelDimension | None = None pad_size: dict[str, int] | None = None ) → BatchFeature
参数
- pixel_values_list (
list[ImageInput]) — List of images (pixel values) to be padded. Each image should be a tensor of shape(channels, height, width). - segmentation_maps (
ImageInput, optional) — The corresponding semantic segmentation maps with the pixel-wise annotations.(
bool, optional, defaults toTrue): Whether or not to pad images up to the largest image in a batch and create a pixel mask.If left to the default, will return a pixel mask that is:
- 1 for pixels that are real (i.e. not masked),
- 0 for pixels that are padding (i.e. masked).
- instance_id_to_semantic_id (
list[dict[int, int]]ordict[int, int], optional) — A mapping between object instance ids and class ids. If passed,segmentation_mapsis treated as an instance segmentation map where each pixel represents an instance id. Can be provided as a single dictionary with a global/dataset-level mapping or as a list of dictionaries (one per image), to map instance ids in each image separately. - return_tensors (
stror TensorType, optional) — If set, will return tensors instead of NumPy arrays. If set to'pt', return PyTorchtorch.Tensorobjects. - pad_size (
Dict[str, int], optional) — The size{"height": int, "width" int}to pad the images to. Must be larger than any image size provided for preprocessing. Ifpad_sizeis not provided, images will be padded to the largest height and width in the batch.
返回
A BatchFeature with the following fields
- pixel_values — 要输入到模型的像素值。
- pixel_mask — 要输入到模型的像素掩码(当
=True或pixel_mask在self.model_input_names中时)。 - mask_labels — 可选的掩码标签列表,形状为
(labels, height, width),用于输入到模型(当提供annotations时)。 - class_labels — Optional list of class labels of shape
(labels)to be fed to a model (whenannotationsare provided). They identify the labels ofmask_labels, e.g. the label ofmask_labels[i][j]ifclass_labels[i][j].
将图像填充到批处理中最大的图像,并创建相应的 pixel_mask。
MaskFormer addresses semantic segmentation with a mask classification paradigm, thus input segmentation maps will be converted to lists of binary masks and their respective labels. Let’s see an example, assuming segmentation_maps = [[2,6,7,9]], the output will contain mask_labels = [[1,0,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]] (four binary masks) and class_labels = [2,6,7,9], the labels for each mask.
post_process_semantic_segmentation
< source >( outputs target_sizes: list[tuple[int, int]] | None = None ) → list[torch.Tensor]
参数
- outputs (MaskFormerForInstanceSegmentation) — Raw outputs of the model.
- target_sizes (
list[tuple[int, int]], optional) — List of length (batch_size), where each list item (tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction. If left to None, predictions will not be resized.
返回
list[torch.Tensor]
一个长度为 batch_size 的列表,其中每个元素都是一个形状为 (height, width) 的语义分割图,对应于 target_sizes 条目(如果指定了 target_sizes)。每个 torch.Tensor 的条目对应一个语义类别 ID。
Converts the output of MaskFormerForInstanceSegmentation into semantic segmentation maps. Only supports PyTorch.
post_process_instance_segmentation
< source >( outputs threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 target_sizes: list[tuple[int, int]] | None = None return_coco_annotation: bool | None = False return_binary_maps: bool | None = False ) → list[Dict]
参数
- outputs (MaskFormerForInstanceSegmentation) — Raw outputs of the model.
- threshold (
float, optional, defaults to 0.5) — The probability score threshold to keep predicted instance masks. - mask_threshold (
float, optional, defaults to 0.5) — Threshold to use when turning the predicted masks into binary values. - overlap_mask_area_threshold (
float, optional, defaults to 0.8) — The overlap mask area threshold to merge or discard small disconnected parts within each binary instance mask. - target_sizes (
list[Tuple], optional) — List of length (batch_size), where each list item (tuple[int, int]]) corresponds to the requested final size (height, width) of each prediction. If left to None, predictions will not be resized. - return_coco_annotation (
bool, optional, defaults toFalse) — If set toTrue, segmentation maps are returned in COCO run-length encoding (RLE) format. - return_binary_maps (
bool, optional, defaults toFalse) — If set toTrue, segmentation maps are returned as a concatenated tensor of binary segmentation maps (one per detected instance).
返回
list[Dict]
字典列表,每个图像一个,每个字典包含两个键
- segmentation — A tensor of shape
(height, width)where each pixel represents asegment_id, orlist[List]run-length encoding (RLE) of the segmentation map if return_coco_annotation is set toTrue, or a tensor of shape(num_instances, height, width)if return_binary_maps is set toTrue. Set toNoneif no mask if found abovethreshold. - segments_info — 包含每个段的附加信息的字典。
- id — 表示
segment_id的整数。 - label_id — 表示与
segment_id对应的标签/语义类 ID 的整数。 - score — 具有
segment_id的段的预测分数。
- id — 表示
Converts the output of MaskFormerForInstanceSegmentationOutput into instance segmentation predictions. Only supports PyTorch. If instances could overlap, set either return_coco_annotation or return_binary_maps to True to get the correct segmentation result.
post_process_panoptic_segmentation
< source >( outputs threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 label_ids_to_fuse: set[int] | None = None target_sizes: list[tuple[int, int]] | None = None ) → list[Dict]
参数
- outputs (
MaskFormerForInstanceSegmentationOutput) — The outputs from MaskFormerForInstanceSegmentation. - threshold (
float, optional, defaults to 0.5) — The probability score threshold to keep predicted instance masks. - mask_threshold (
float, optional, defaults to 0.5) — Threshold to use when turning the predicted masks into binary values. - overlap_mask_area_threshold (
float, optional, defaults to 0.8) — The overlap mask area threshold to merge or discard small disconnected parts within each binary instance mask. - label_ids_to_fuse (
Set[int], optional) — The labels in this state will have all their instances be fused together. For instance we could say there can only be one sky in an image, but several persons, so the label ID for sky would be in that set, but not the one for person. - target_sizes (
list[Tuple], optional) — 长度为 (batch_size) 的列表,其中每个列表项 (tuple[int, int]]) 对应于批次中每个预测的请求最终大小(高度,宽度)。如果保留 None,则不会调整预测大小。
返回
list[Dict]
字典列表,每个图像一个,每个字典包含两个键
- segmentation — 形状为
(height, width)的张量,其中每个像素代表一个segment_id,如果未找到高于threshold的掩码,则设置为None。如果指定了target_sizes,则分割大小会调整为相应的target_sizes条目。 - segments_info — 包含每个段的附加信息的字典。
- id — 表示
segment_id的整数。 - label_id — 表示与
segment_id对应的标签/语义类 ID 的整数。 - was_fused — 一个布尔值,如果
label_id在label_ids_to_fuse中则为True,否则为False。同一类别/标签的多个实例已融合并分配了一个单独的segment_id。 - score — 具有
segment_id的段的预测分数。
- id — 表示
将 MaskFormerForInstanceSegmentationOutput 的输出转换为图像全景分割预测。仅支持 PyTorch。
MaskFormerImageProcessorFast
class transformers.MaskFormerImageProcessorFast
< source >( **kwargs: typing_extensions.Unpack[transformers.models.maskformer.image_processing_maskformer.MaskFormerImageProcessorKwargs] )
构造一个快速 Maskformer 图像处理器。
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None instance_id_to_semantic_id: list[dict[int, int]] | dict[int, int] | None = None **kwargs: typing_extensions.Unpack[transformers.models.maskformer.image_processing_maskformer.MaskFormerImageProcessorKwargs] ) → <class 'transformers.image_processing_base.BatchFeature'>
参数
- images (
Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list, list, list]) — 要预处理的图像。需要单个图像或批处理图像,像素值范围为 0 到 255。如果传入的图像像素值在 0 到 1 之间,请设置do_rescale=False。 - segmentation_maps (
ImageInput, optional) — 分割图。 - instance_id_to_semantic_id (
Union[list[dict[int, int]], dict[int, int]], optional) — 实例 ID 到语义 ID 的映射。 - do_convert_rgb (
bool | None.do_convert_rgb) — 是否将图像转换为 RGB。 - do_resize (
bool | None.do_resize) — 是否调整图像大小。 - size (
Annotated[int | list[int] | tuple[int, ...] | dict[str, int] | None, None]) — 描述了模型的最大输入维度。 - crop_size (
Annotated[int | list[int] | tuple[int, ...] | dict[str, int] | None, None]) — 应用center_crop后输出图像的大小。 - resample (
Annotated[Union[PILImageResampling, int, NoneType], None]) — 如果调整图像大小,则使用的重采样过滤器。可以是枚举PILImageResampling之一。仅当do_resize设置为True时才有效。 - do_rescale (
bool | None.do_rescale) — 是否重新缩放图像。 - rescale_factor (
float | None.rescale_factor) — 如果do_rescale设置为True,则用于重新缩放图像的重新缩放因子。 - do_normalize (
bool | None.do_normalize) — 是否规范化图像。 - image_mean (
float | list[float] | tuple[float, ...] | None.image_mean) — 用于规范化的图像均值。仅当do_normalize设置为True时才有效。 - image_std (
float | list[float] | tuple[float, ...] | None.image_std) — 用于规范化的图像标准差。仅当do_normalize设置为True时才有效。 - do_pad (
bool | None.do_pad) — 是否填充图像。填充是针对批次中最大的大小进行的,或者针对每个图像的固定正方形大小进行的。具体的填充策略取决于模型。 - pad_size (
Annotated[int | list[int] | tuple[int, ...] | dict[str, int] | None, None]) — 要将图像填充到的大小,格式为{"height": int, "width" int}。必须大于所提供的任何预处理图像的大小。如果未提供pad_size,图像将填充到批次中最大的高度和宽度。仅当do_pad=True时应用。 - do_center_crop (
bool | None.do_center_crop) — 是否对图像进行中心裁剪。 - data_format (
str | ~image_utils.ChannelDimension | None.data_format) — 仅支持ChannelDimension.FIRST。为与慢速处理器兼容而添加。 - input_data_format (
str | ~image_utils.ChannelDimension | None.input_data_format) — 输入图像的通道维度格式。如果未设置,则从输入图像中推断通道维度格式。可以是以下之一:"channels_first"或ChannelDimension.FIRST:图像格式为 (num_channels, height, width)。"channels_last"或ChannelDimension.LAST:图像格式为 (height, width, num_channels)。"none"或ChannelDimension.NONE:图像格式为 (height, width)。
- device (
Annotated[Union[str, torch.device, NoneType], None]) — 处理图像的设备。如果未设置,则从输入图像中推断设备。 - return_tensors (
Annotated[str | ~utils.generic.TensorType | None, None]) — 如果设置为 `pt`,则返回堆叠的张量,否则返回张量列表。 - disable_grouping (
bool | None.disable_grouping) — 是否禁用按大小对图像进行分组以单独处理而不是分批处理。如果为 None,则如果图像在 CPU 上,则设置为 True,否则设置为 False。此选择基于经验观察,详情请参阅此处:https://github.com/huggingface/transformers/pull/38157 - image_seq_length (
int | None.image_seq_length) — 输入中每张图像要使用的图像 token 数量。为向后兼容而添加,但将来应将其设置为处理器属性。 - size_divisor (
<class 'int'>.size_divisor) — 确保高度和宽度都可以被整除的大小。 - ignore_index (
int, optional) — 分割图中分配给背景像素的标签。如果提供,分割图中标记为 0(背景)的像素将替换为ignore_index。 - do_reduce_labels (
bool, optional, defaults toFalse) — 是否将分割图的所有标签值减 1。通常用于背景用 0 表示,且背景本身不包含在数据集的所有类别中的数据集(例如 ADE20k)。背景标签将替换为ignore_index。 - num_labels (
int, optional) — 分割图中的标签数量。
返回
<class 'transformers.image_processing_base.BatchFeature'>
- data (
dict) — 由 call 方法返回的列表/数组/张量字典(“pixel_values”等)。 - tensor_type (
Union[None, str, TensorType], optional) — 您可以在此处提供 tensor_type 以在初始化时将整数列表转换为 PyTorch/Numpy 张量。
post_process_semantic_segmentation
< source >( outputs target_sizes: list[tuple[int, int]] | None = None ) → list[torch.Tensor]
参数
- outputs (MaskFormerForInstanceSegmentation) — 模型的原始输出。
- target_sizes (
list[tuple[int, int]], optional) — 长度为 (batch_size) 的列表,其中每个列表项 (tuple[int, int]]) 对应于每个预测的请求最终大小(高度,宽度)。如果保留 None,则不会调整预测大小。
返回
list[torch.Tensor]
一个长度为 batch_size 的列表,其中每个元素都是一个形状为 (height, width) 的语义分割图,对应于 target_sizes 条目(如果指定了 target_sizes)。每个 torch.Tensor 的条目对应一个语义类别 ID。
Converts the output of MaskFormerForInstanceSegmentation into semantic segmentation maps. Only supports PyTorch.
post_process_instance_segmentation
< source >( outputs threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 target_sizes: list[tuple[int, int]] | None = None return_coco_annotation: bool | None = False return_binary_maps: bool | None = False ) → list[Dict]
参数
- outputs (MaskFormerForInstanceSegmentation) — 模型的原始输出。
- threshold (
float, optional, defaults to 0.5) — 保持预测实例掩码的概率得分阈值。 - mask_threshold (
float, optional, defaults to 0.5) — 将预测掩码转换为二进制值时使用的阈值。 - overlap_mask_area_threshold (
float, optional, defaults to 0.8) — 用于合并或丢弃每个二进制实例掩码中不相连的小部分的重叠掩码区域阈值。 - target_sizes (
list[Tuple], optional) — 长度为 (batch_size) 的列表,其中每个列表项 (tuple[int, int]]) 对应于每个预测的请求最终大小(高度,宽度)。如果保留 None,则不会调整预测大小。 - return_coco_annotation (
bool, optional, defaults toFalse) — 如果设置为True,分割图将以 COCO 游程编码(RLE)格式返回。 - return_binary_maps (
bool, optional, defaults toFalse) — 如果设置为True,分割图将作为二进制分割图的连接张量返回(每个检测到的实例一个)。
返回
list[Dict]
字典列表,每个图像一个,每个字典包含两个键
- segmentation — A tensor of shape
(height, width)where each pixel represents asegment_id, orlist[List]run-length encoding (RLE) of the segmentation map if return_coco_annotation is set toTrue, or a tensor of shape(num_instances, height, width)if return_binary_maps is set toTrue. Set toNoneif no mask if found abovethreshold. - segments_info — 包含每个段的附加信息的字典。
- id — 表示
segment_id的整数。 - label_id — 表示与
segment_id对应的标签/语义类 ID 的整数。 - score — 具有
segment_id的段的预测分数。
- id — 表示
Converts the output of MaskFormerForInstanceSegmentationOutput into instance segmentation predictions. Only supports PyTorch. If instances could overlap, set either return_coco_annotation or return_binary_maps to True to get the correct segmentation result.
post_process_panoptic_segmentation
< source >( outputs threshold: float = 0.5 mask_threshold: float = 0.5 overlap_mask_area_threshold: float = 0.8 label_ids_to_fuse: set[int] | None = None target_sizes: list[tuple[int, int]] | None = None ) → list[Dict]
参数
- outputs (
MaskFormerForInstanceSegmentationOutput) — 来自 MaskFormerForInstanceSegmentation 的输出。 - threshold (
float, optional, defaults to 0.5) — 保持预测实例掩码的概率得分阈值。 - mask_threshold (
float, optional, defaults to 0.5) — 将预测掩码转换为二进制值时使用的阈值。 - overlap_mask_area_threshold (
float, optional, defaults to 0.8) — 用于合并或丢弃每个二进制实例掩码中不相连的小部分的重叠掩码区域阈值。 - label_ids_to_fuse (
Set[int], optional) — 在此状态下的标签的所有实例都将融合在一起。例如,我们可以说图像中只能有一个天空,但可以有几个人,因此天空的标签 ID 将在此集合中,而人的标签 ID 则不会。 - target_sizes (
list[Tuple], optional) — 长度为 (batch_size) 的列表,其中每个列表项 (tuple[int, int]]) 对应于批次中每个预测的请求最终大小(高度,宽度)。如果保留 None,则不会调整预测大小。
返回
list[Dict]
字典列表,每个图像一个,每个字典包含两个键
- segmentation — 形状为
(height, width)的张量,其中每个像素代表一个segment_id,如果未找到高于threshold的掩码,则设置为None。如果指定了target_sizes,则分割大小会调整为相应的target_sizes条目。 - segments_info — 包含每个段的附加信息的字典。
- id — 表示
segment_id的整数。 - label_id — 表示与
segment_id对应的标签/语义类 ID 的整数。 - was_fused — 一个布尔值,如果
label_id在label_ids_to_fuse中则为True,否则为False。同一类别/标签的多个实例已融合并分配了一个单独的segment_id。 - score — 具有
segment_id的段的预测分数。
- id — 表示
将 MaskFormerForInstanceSegmentationOutput 的输出转换为图像全景分割预测。仅支持 PyTorch。
MaskFormerModel
class transformers.MaskFormerModel
< source >( config: MaskFormerConfig )
参数
- config (MaskFormerConfig) — 模型配置类,包含模型的所有参数。使用配置文件初始化时不会加载与模型相关的权重,只会加载配置。请查看 from_pretrained() 方法以加载模型权重。
没有特定顶层头部的裸 Maskformer 模型,输出原始隐藏状态。
此模型继承自 PreTrainedModel。查看其父类文档,了解库为所有模型实现的通用方法(例如下载或保存、调整输入嵌入大小、修剪头等)。
此模型也是一个 PyTorch torch.nn.Module 子类。像普通的 PyTorch Module 一样使用它,并参考 PyTorch 文档了解一般用法和行为的所有相关信息。
forward
< source >( pixel_values: Tensor pixel_mask: torch.Tensor | None = None output_hidden_states: bool | None = None output_attentions: bool | None = None return_dict: bool | None = None **kwargs ) → transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutput or tuple(torch.FloatTensor)
参数
- pixel_values (
torch.Tensor,形状为(batch_size, num_channels, image_size, image_size)) — 对应于输入图像的张量。像素值可以使用 MaskFormerImageProcessorFast 获取。有关详细信息,请参阅 MaskFormerImageProcessorFast.__call__()(processor_class使用 MaskFormerImageProcessorFast 进行图像处理)。 - pixel_mask (
torch.Tensor,形状为(batch_size, height, width),可选) — 用于避免在填充像素值上执行注意力的掩码。掩码值选择在[0, 1]之间:- 1 表示真实(即未被掩码)的像素,
- 0 表示填充(即被掩码)的像素。
- output_hidden_states (
bool, 可选) — 是否返回所有层的隐藏状态。有关详细信息,请参阅返回张量下的hidden_states。 - output_attentions (
bool, 可选) — 是否返回所有注意力层的注意力张量。有关详细信息,请参阅返回张量下的attentions。 - return_dict (
bool, 可选) — 是否返回 ModelOutput 而不是普通的元组。
返回
transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutput or tuple(torch.FloatTensor)
一个 transformers.models.maskformer.modeling_maskformer.MaskFormerModelOutput 对象或一个包含多个元素的 torch.FloatTensor 元组(如果传递了 return_dict=False 或 config.return_dict=False),具体取决于配置 (MaskFormerConfig) 和输入。
-
encoder_last_hidden_state (
torch.FloatTensor,形状为(batch_size, num_channels, height, width)) — 编码器模型(主干网络)最后一阶段的最后隐藏状态(最终特征图)。 -
pixel_decoder_last_hidden_state (
torch.FloatTensor,形状为(batch_size, num_channels, height, width)) — 像素解码器模型(FPN)最后一阶段的最后隐藏状态(最终特征图)。 -
transformer_decoder_last_hidden_state (
torch.FloatTensor,形状为(batch_size, sequence_length, hidden_size)) — Transformer 解码器模型最后一阶段的最后隐藏状态(最终特征图)。 -
encoder_hidden_states (
tuple(torch.FloatTensor), 可选,当传递output_hidden_states=True或config.output_hidden_states=True时返回) —torch.FloatTensor元组(一个用于嵌入输出 + 每个阶段输出一个),形状为(batch_size, num_channels, height, width)。编码器模型在每个阶段输出的隐藏状态(也称为特征图)。 -
pixel_decoder_hidden_states (
tuple(torch.FloatTensor), 可选,当传递output_hidden_states=True或config.output_hidden_states=True时返回) —torch.FloatTensor元组(一个用于嵌入输出 + 每个阶段输出一个),形状为(batch_size, num_channels, height, width)。像素解码器模型在每个阶段输出的隐藏状态(也称为特征图)。 -
transformer_decoder_hidden_states (
tuple(torch.FloatTensor), 可选,当传递output_hidden_states=True或config.output_hidden_states=True时返回) —torch.FloatTensor元组(一个用于嵌入输出 + 每个阶段输出一个),形状为(batch_size, sequence_length, hidden_size)。Transformer 解码器在每个阶段输出的隐藏状态(也称为特征图)。 -
hidden_states
tuple(torch.FloatTensor), 可选,当传递output_hidden_states=True或config.output_hidden_states=True时返回) — 包含encoder_hidden_states、pixel_decoder_hidden_states和decoder_hidden_states的torch.FloatTensor元组 -
hidden_states (
tuple[torch.FloatTensor] | None.hidden_states, 当传递output_hidden_states=True或当config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入的输出,如果模型有嵌入层,+ 每个层的输出),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
-
attentions (
tuple[torch.FloatTensor] | None.attentions, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
MaskFormerModel 的 forward 方法,覆盖了 __call__ 特殊方法。
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
示例
>>> from transformers import AutoImageProcessor, MaskFormerModel
>>> from PIL import Image
>>> import httpx
>>> from io import BytesIO
>>> # load MaskFormer fine-tuned on ADE20k semantic segmentation
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/maskformer-swin-base-ade")
>>> model = MaskFormerModel.from_pretrained("facebook/maskformer-swin-base-ade")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> with httpx.stream("GET", url) as response:
... image = Image.open(BytesIO(response.read()))
>>> inputs = image_processor(image, return_tensors="pt")
>>> # forward pass
>>> outputs = model(**inputs)
>>> # the decoder of MaskFormer outputs hidden states of shape (batch_size, num_queries, hidden_size)
>>> transformer_decoder_last_hidden_state = outputs.transformer_decoder_last_hidden_state
>>> list(transformer_decoder_last_hidden_state.shape)
[1, 100, 256]MaskFormerForInstanceSegmentation
forward
< source >( pixel_values: Tensor mask_labels: list[torch.Tensor] | None = None class_labels: list[torch.Tensor] | None = None pixel_mask: torch.Tensor | None = None output_auxiliary_logits: bool | None = None output_hidden_states: bool | None = None output_attentions: bool | None = None return_dict: bool | None = None **kwargs ) → transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput or tuple(torch.FloatTensor)
参数
- pixel_values (
torch.Tensor,形状为(batch_size, num_channels, image_size, image_size)) — 对应于输入图像的张量。像素值可以使用 MaskFormerImageProcessorFast 获取。有关详细信息,请参阅 MaskFormerImageProcessorFast.__call__()(processor_class使用 MaskFormerImageProcessorFast 进行图像处理)。 - mask_labels (
list[torch.Tensor], 可选) — 形状为(num_labels, height, width)的掩码标签列表,用于馈送给模型 - class_labels (
list[torch.LongTensor], 可选) — 目标类别标签列表,形状为(num_labels, height, width),用于馈送给模型。它们标识mask_labels的标签,例如mask_labels[i][j]的标签为class_labels[i][j]。 - pixel_mask (
torch.Tensor,形状为(batch_size, height, width),可选) — 用于避免在填充像素值上执行注意力的掩码。掩码值选择在[0, 1]之间:- 1 表示真实(即未被掩码)的像素,
- 0 表示填充(即被掩码)的像素。
- output_auxiliary_logits (
bool, 可选) — 是否输出辅助 logits。 - output_hidden_states (
bool, 可选) — 是否返回所有层的隐藏状态。有关详细信息,请参阅返回张量下的hidden_states。 - output_attentions (
bool, 可选) — 是否返回所有注意力层的注意力张量。有关详细信息,请参阅返回张量下的attentions。 - return_dict (
bool, 可选) — 是否返回 ModelOutput 而不是普通的元组。
返回
transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput or tuple(torch.FloatTensor)
一个 transformers.models.maskformer.modeling_maskformer.MaskFormerForInstanceSegmentationOutput 对象或一个包含多个元素的 torch.FloatTensor 元组(如果传递了 return_dict=False 或 config.return_dict=False),具体取决于配置 (MaskFormerConfig) 和输入。
-
loss (
torch.Tensor, 可选) — 计算出的损失,当存在标签时返回。 -
class_queries_logits (
torch.FloatTensor | None.class_queries_logits, 默认为None) — 形状为(batch_size, num_queries, num_labels + 1)的张量,表示每个查询的建议类别。请注意,需要+ 1是因为我们包含了空类别。 -
masks_queries_logits (
torch.FloatTensor | None.masks_queries_logits, 默认为None) — 形状为(batch_size, num_queries, height, width)的张量,表示每个查询的建议掩码。 -
auxiliary_logits (
Dict[str, torch.FloatTensor], 可选,当传递output_auxiliary_logits=True时返回) — 包含辅助损失启用时每个解码器层的辅助预测的字典。 -
encoder_last_hidden_state (
torch.FloatTensor,形状为(batch_size, num_channels, height, width)) — 编码器模型(主干网络)最后一阶段的最后隐藏状态(最终特征图)。 -
pixel_decoder_last_hidden_state (
torch.FloatTensor,形状为(batch_size, num_channels, height, width)) — 像素解码器模型(FPN)最后一阶段的最后隐藏状态(最终特征图)。 -
transformer_decoder_last_hidden_state (
torch.FloatTensor,形状为(batch_size, sequence_length, hidden_size)) — Transformer 解码器模型最后一阶段的最后隐藏状态(最终特征图)。 -
encoder_hidden_states (
tuple(torch.FloatTensor), 可选,当传递output_hidden_states=True或config.output_hidden_states=True时返回) —torch.FloatTensor元组(一个用于嵌入输出 + 每个阶段输出一个),形状为(batch_size, num_channels, height, width)。编码器模型在每个阶段输出的隐藏状态(也称为特征图)。 -
pixel_decoder_hidden_states (
tuple(torch.FloatTensor), 可选,当传递output_hidden_states=True或config.output_hidden_states=True时返回) —torch.FloatTensor元组(一个用于嵌入输出 + 每个阶段输出一个),形状为(batch_size, num_channels, height, width)。像素解码器模型在每个阶段输出的隐藏状态(也称为特征图)。 -
transformer_decoder_hidden_states (
tuple(torch.FloatTensor), 可选,当传递output_hidden_states=True或config.output_hidden_states=True时返回) —torch.FloatTensor元组(一个用于嵌入输出 + 每个阶段输出一个),形状为(batch_size, sequence_length, hidden_size)。Transformer 解码器在每个阶段输出的隐藏状态。 -
hidden_states
tuple(torch.FloatTensor), 可选,当传递output_hidden_states=True或config.output_hidden_states=True时返回) — 包含encoder_hidden_states、pixel_decoder_hidden_states和decoder_hidden_states的torch.FloatTensor元组。 -
hidden_states (
tuple[torch.FloatTensor] | None.hidden_states, 当传递output_hidden_states=True或当config.output_hidden_states=True时返回) —torch.FloatTensor的元组(一个用于嵌入的输出,如果模型有嵌入层,+ 每个层的输出),形状为(batch_size, sequence_length, hidden_size)。模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
-
attentions (
tuple[torch.FloatTensor] | None.attentions, 当传递output_attentions=True或当config.output_attentions=True时返回) —torch.FloatTensor的元组(每个层一个),形状为(batch_size, num_heads, sequence_length, sequence_length)。注意力 softmax 后的注意力权重,用于计算自注意力头中的加权平均值。
MaskFormerForInstanceSegmentation 的 forward 方法,覆盖了 __call__ 特殊方法。
虽然 forward pass 的实现需要在此函数中定义,但你应该在之后调用
Module实例而不是这个,因为前者负责运行预处理和后处理步骤,而后者会静默地忽略它们。
示例
语义分割示例
>>> from transformers import AutoImageProcessor, MaskFormerForInstanceSegmentation
>>> from PIL import Image
>>> import httpx
>>> from io import BytesIO
>>> # load MaskFormer fine-tuned on ADE20k semantic segmentation
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/maskformer-swin-base-ade")
>>> model = MaskFormerForInstanceSegmentation.from_pretrained("facebook/maskformer-swin-base-ade")
>>> url = (
... "https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg"
... )
>>> with httpx.stream("GET", url) as response:
... image = Image.open(BytesIO(response.read()))
>>> inputs = image_processor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> # model predicts class_queries_logits of shape `(batch_size, num_queries)`
>>> # and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
>>> class_queries_logits = outputs.class_queries_logits
>>> masks_queries_logits = outputs.masks_queries_logits
>>> # you can pass them to image_processor for postprocessing
>>> predicted_semantic_map = image_processor.post_process_semantic_segmentation(
... outputs, target_sizes=[(image.height, image.width)]
... )[0]
>>> # we refer to the demo notebooks for visualization (see "Resources" section in the MaskFormer docs)
>>> list(predicted_semantic_map.shape)
[512, 683]全景分割示例
>>> from transformers import AutoImageProcessor, MaskFormerForInstanceSegmentation
>>> from PIL import Image
>>> import httpx
>>> from io import BytesIO
>>> # load MaskFormer fine-tuned on COCO panoptic segmentation
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/maskformer-swin-base-coco")
>>> model = MaskFormerForInstanceSegmentation.from_pretrained("facebook/maskformer-swin-base-coco")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> with httpx.stream("GET", url) as response:
... image = Image.open(BytesIO(response.read()))
>>> inputs = image_processor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> # model predicts class_queries_logits of shape `(batch_size, num_queries)`
>>> # and masks_queries_logits of shape `(batch_size, num_queries, height, width)`
>>> class_queries_logits = outputs.class_queries_logits
>>> masks_queries_logits = outputs.masks_queries_logits
>>> # you can pass them to image_processor for postprocessing
>>> result = image_processor.post_process_panoptic_segmentation(outputs, target_sizes=[(image.height, image.width)])[0]
>>> # we refer to the demo notebooks for visualization (see "Resources" section in the MaskFormer docs)
>>> predicted_panoptic_map = result["segmentation"]
>>> list(predicted_panoptic_map.shape)
[480, 640]