Transformers

( hidden_size = 256 hidden_act = 'relu' mlp_dim = 2048 num_hidden_layers = 2 num_attention_heads = 8 attention_downsample_rate = 2 num_multimask_outputs = 3 iou_head_depth = 3 iou_head_hidden_dim = 256 layer_norm_eps = 1e-06 **kwargs )

参数

hidden_size (int, 可选, 默认为 256) — 隐藏状态的维度。
hidden_act (str, 可选, 默认为 "relu") — SamMaskDecoder 模块中使用的非线性激活函数。
mlp_dim (int, 可选, 默认为 2048) — Transformer 编码器中“中间”（即前馈）层的维度。
num_hidden_layers (int, 可选, 默认为 2) — Transformer 编码器中的隐藏层数量。
num_attention_heads (int, 可选, 默认为 8) — Transformer 编码器中每个注意力层的注意力头数量。
attention_downsample_rate (int, 可选, 默认为 2) — 注意力层的下采样率。
num_multimask_outputs (int, 可选, 默认为 3) — SamMaskDecoder 模块的输出数量。在 Segment Anything 论文中，此值设置为 3。
iou_head_depth (int, 可选, 默认为 3) — IoU 头部模块中的层数。
iou_head_hidden_dim (int, 可选, 默认为 256) — IoU 头部模块中隐藏状态的维度。
layer_norm_eps (float, 可选, 默认为 1e-06) — 层归一化层使用的 epsilon 值。

这是用于存储 SamMaskDecoder 配置的配置类。它用于根据指定的参数实例化一个 SAM 掩码解码器，从而定义模型架构。实例化一个默认配置将产生与 SAM-vit-h facebook/sam-vit-huge 架构类似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

SamPromptEncoderConfig

类 transformers.SamPromptEncoderConfig

( hidden_size = 256 image_size = 1024 patch_size = 16 mask_input_channels = 16 num_point_embeddings = 4 hidden_act = 'gelu' layer_norm_eps = 1e-06 **kwargs )

参数

hidden_size (int, 可选, 默认为 256) — 隐藏状态的维度。
image_size (int, 可选, 默认为 1024) — 图像的预期输出分辨率。
patch_size (int, 可选, 默认为 16) — 每个补丁的大小（分辨率）。
mask_input_channels (int, 可选, 默认为 16) — 提供给 MaskDecoder 模块的通道数。
num_point_embeddings (int, 可选, 默认为 4) — 要使用的点嵌入数量。
hidden_act (str, 可选, 默认为 "gelu") — 编码器和池化器中的非线性激活函数。

这是用于存储 SamPromptEncoder 配置的配置类。SamPromptEncoder 模块用于对输入的二维点和边界框进行编码。实例化一个默认配置将产生与 SAM-vit-h facebook/sam-vit-huge 架构类似的配置。

配置对象继承自 PretrainedConfig，可用于控制模型输出。有关更多信息，请阅读 PretrainedConfig 的文档。

SamProcessor

类 transformers.SamProcessor

( image_processor )

参数

image_processor (SamImageProcessor) — SamImageProcessor 的一个实例。图像处理器是必需的输入。

构建一个 SAM 处理器，它将 SAM 图像处理器和二维点及边界框处理器包装成一个单一的处理器。

SamProcessor 提供了 SamImageProcessor 的所有功能。更多信息请参见 call() 的文档字符串。

SamImageProcessor

类 transformers.SamImageProcessor

( do_resize: bool = True size: typing.Optional[dict[str, int]] = None mask_size: typing.Optional[dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None do_pad: bool = True pad_size: typing.Optional[int] = None mask_pad_size: typing.Optional[int] = None do_convert_rgb: bool = True **kwargs )

参数

do_resize (bool, 可选, 默认为 True) — 是否将图像的（高，宽）维度调整为指定的 size。可在 preprocess 方法中通过 do_resize 参数覆盖。
size (dict, 可选, 默认为 {"longest_edge": 1024}): 调整大小后输出图像的尺寸。将图像的最长边调整为与 size["longest_edge"] 匹配，同时保持纵横比。可在 preprocess 方法中通过 size 参数覆盖。
mask_size (dict, 可选, 默认为 {"longest_edge": 256}): 调整大小后输出分割图的尺寸。将图像的最长边调整为与 size["longest_edge"] 匹配，同时保持纵横比。可在 preprocess 方法中通过 mask_size 参数覆盖。
resample (PILImageResampling, 可选, 默认为 Resampling.BILINEAR) — 如果调整图像大小，使用的重采样滤波器。可在 preprocess 方法中通过 resample 参数覆盖。
do_rescale (bool, 可选, 默认为 True) — 是否按指定的比例 rescale_factor 重新缩放图像。可在 preprocess 方法中通过 do_rescale 参数覆盖。
rescale_factor (int 或 float, 可选, 默认为 1/255) — 如果重新缩放图像，使用的缩放因子。仅当 do_rescale 设置为 True 时生效。可在 preprocess 方法中通过 rescale_factor 参数覆盖。
do_normalize (bool, 可选, 默认为 True) — 是否对图像进行归一化。可在 preprocess 方法中通过 do_normalize 参数覆盖。
image_mean (float 或 list[float], 可选, 默认为 IMAGENET_DEFAULT_MEAN) — 如果对图像进行归一化，使用的均值。这是一个浮点数或浮点数列表，长度与图像通道数相同。可在 preprocess 方法中通过 image_mean 参数覆盖。
image_std (float 或 list[float], 可选, 默认为 IMAGENET_DEFAULT_STD) — 如果对图像进行归一化，使用的标准差。这是一个浮点数或浮点数列表，长度与图像通道数相同。可在 preprocess 方法中通过 image_std 参数覆盖。
do_pad (bool, 可选, 默认为 True) — 是否将图像填充到指定的 pad_size。可在 preprocess 方法中通过 do_pad 参数覆盖。
pad_size (dict, 可选, 默认为 {"height": 1024, "width": 1024}): 填充后输出图像的尺寸。可在 preprocess 方法中通过 pad_size 参数覆盖。
mask_pad_size (dict, 可选, 默认为 {"height": 256, "width": 256}): 填充后输出分割图的尺寸。可在 preprocess 方法中通过 mask_pad_size 参数覆盖。
do_convert_rgb (bool, 可选, 默认为 True) — 是否将图像转换为 RGB 格式。

构建一个 SAM 图像处理器。

filter_masks

( masks iou_scores original_size cropped_box_image pred_iou_thresh = 0.88 stability_score_thresh = 0.95 mask_threshold = 0 stability_score_offset = 1 return_tensors = 'pt' )

参数

masks (Union[torch.Tensor, tf.Tensor]) — 输入掩码。
iou_scores (Union[torch.Tensor, tf.Tensor]) — IoU 分数列表。
original_size (tuple[int,int]) — 原始图像的尺寸。
cropped_box_image (np.array) — 裁剪后的图像。
pred_iou_thresh (float, 可选, 默认为 0.88) — iou 分数的阈值。
stability_score_thresh (float, 可选, 默认为 0.95) — 稳定性分数的阈值。
mask_threshold (float, 可选, 默认为 0) — 预测掩码的阈值。
stability_score_offset (float, 可选, 默认为 1) — 在 _compute_stability_score 方法中使用的稳定性分数偏移量。
return_tensors (str, 可选, 默认为 pt) — 如果是 pt，则返回 torch.Tensor。如果是 tf，则返回 tf.Tensor。

通过选择仅满足多个标准的预测掩码来过滤它们。第一个标准是 iou 分数需要大于 pred_iou_thresh。第二个标准是稳定性分数需要大于 stability_score_thresh。该方法还将预测的掩码转换为边界框，并在必要时对预测的掩码进行填充。

generate_crop_boxes

( image target_size crop_n_layers: int = 0 overlap_ratio: float = 0.3413333333333333 points_per_crop: typing.Optional[int] = 32 crop_n_points_downscale_factor: typing.Optional[list[int]] = 1 device: typing.Optional[ForwardRef('torch.device')] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None return_tensors: str = 'pt' )

参数

image (np.array) — 输入的原始图像
target_size (int) — 调整大小后图像的目标尺寸
crop_n_layers (int, 可选, 默认为 0) — 如果 >0，掩码预测将在图像的裁剪块上再次运行。设置要运行的层数，其中每一层有 2**i_layer 个图像裁剪块。
overlap_ratio (float, 可选, 默认为 512/1500) — 设置裁剪块重叠的程度。在第一个裁剪层中，裁剪块将按此图像长度比例重叠。后续层中裁剪块数量更多，重叠比例会相应缩小。
points_per_crop (int, 可选, 默认为 32) — 从每个裁剪块中采样的点数。
crop_n_points_downscale_factor (list[int], 可选, 默认为 1) — 在第 n 层中每边采样的点数将按 crop_n_points_downscale_factor**n 缩小。
device (torch.device, 可选, 默认为 None) — 用于计算的设备。如果为 None，将使用 cpu。
input_data_format (str 或 ChannelDimension, 可选) — 输入图像的通道维度格式。如果未提供，将自动推断。
return_tensors (str, 可选, 默认为 pt) — 如果是 pt，则返回 torch.Tensor。如果是 tf，则返回 tf.Tensor。

生成不同尺寸的裁剪框列表。第 i 层有 (2**i)**2 个框。

pad_image

( image: ndarray pad_size: dict data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None **kwargs )

参数

image (np.ndarray) — 需要填充的图像。
pad_size (dict[str, int]) — 填充后输出图像的尺寸。
data_format (str 或 ChannelDimension, 可选) — 图像的数据格式。可以是 “channels_first” 或 “channels_last”。如果为 None，则使用 image 的 data_format。
input_data_format (str 或 ChannelDimension, 可选) — 输入图像的通道维度格式。如果未提供，将自动推断。

将图像用零填充到 (pad_size["height"], pad_size["width"])，填充在右侧和底部。

post_process_for_mask_generation

( all_masks all_scores all_boxes crops_nms_thresh return_tensors = 'pt' )

参数

all_masks (Union[list[torch.Tensor], list[tf.Tensor]]) — 所有预测的分割掩码列表
all_scores (Union[list[torch.Tensor], list[tf.Tensor]]) — 所有预测的iou分数列表
all_boxes (Union[list[torch.Tensor], list[tf.Tensor]]) — 所有预测掩码的边界框列表
crops_nms_thresh (float) — NMS（非极大值抑制）算法的阈值。
return_tensors (str, 可选, 默认为 pt) — 如果是 pt，返回 torch.Tensor。如果是 tf，返回 tf.Tensor。

通过对预测的掩码调用非极大值抑制算法来后处理生成的掩码。

post_process_masks

( masks original_sizes reshaped_input_sizes mask_threshold = 0.0 binarize = True pad_size = None return_tensors = 'pt' ) → (Union[torch.Tensor, tf.Tensor])

参数

masks (Union[list[torch.Tensor], list[np.ndarray], list[tf.Tensor]]) — 来自 mask_decoder 的批处理掩码，格式为 (batch_size, num_channels, height, width)。
original_sizes (Union[torch.Tensor, tf.Tensor, list[tuple[int,int]]]) — 每张图像在调整为模型期望的输入形状之前的原始尺寸，格式为 (height, width)。
reshaped_input_sizes (Union[torch.Tensor, tf.Tensor, list[tuple[int,int]]]) — 每张图像输入模型时的尺寸，格式为 (height, width)，用于移除填充。
mask_threshold (float, 可选, 默认为 0.0) — 用于二值化掩码的阈值。
binarize (bool, 可选, 默认为 True) — 是否二值化掩码。
pad_size (int, 可选, 默认为 self.pad_size) — 图像在传递给模型前被填充到的目标尺寸。如果为 None，则假定目标尺寸为处理器的 pad_size。
return_tensors (str, 可选, 默认为 "pt") — 如果是 "pt"，返回 PyTorch 张量。如果是 "tf"，返回 TensorFlow 张量。

(Union[torch.Tensor, tf.Tensor])

批处理掩码，格式为 (batch_size, num_channels, height, width)，其中 (height, width) 由 original_size 给出。

移除填充并将掩码上采样至原始图像尺寸。

preprocess

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] segmentation_maps: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], NoneType] = None do_resize: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None mask_size: typing.Optional[dict[str, int]] = None resample: typing.Optional[ForwardRef('PILImageResampling')] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Union[int, float, NoneType] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None do_pad: typing.Optional[bool] = None pad_size: typing.Optional[dict[str, int]] = None mask_pad_size: typing.Optional[dict[str, int]] = None do_convert_rgb: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

参数

images (ImageInput) — 需要预处理的图像。期望单个或一批图像，像素值范围为 0 到 255。如果传入的图像像素值在 0 到 1 之间，请设置 do_rescale=False。
segmentation_maps (ImageInput, 可选) — 需要预处理的分割图。
do_resize (bool, 可选, 默认为 self.do_resize) — 是否调整图像大小。
size (dict[str, int], 可选, 默认为 self.size) — 控制 resize 后图像的尺寸。图像的最长边将被调整为 size["longest_edge"]，同时保持宽高比。
mask_size (dict[str, int], 可选, 默认为 self.mask_size) — 控制 resize 后分割图的尺寸。图像的最长边将被调整为 size["longest_edge"]，同时保持宽高比。
resample (PILImageResampling, 可选, 默认为 self.resample) — 调整图像大小时使用的 PILImageResampling 滤波器，例如 PILImageResampling.BILINEAR。
do_rescale (bool, 可选, 默认为 self.do_rescale) — 是否通过缩放因子重新缩放图像像素值。
rescale_factor (int 或 float, 可选, 默认为 self.rescale_factor) — 应用于图像像素值的缩放因子。
do_normalize (bool, 可选, 默认为 self.do_normalize) — 是否对图像进行归一化。
image_mean (float 或 list[float], 可选, 默认为 self.image_mean) — 当 do_normalize 设置为 True 时，用于归一化图像的均值。
image_std (float 或 list[float], 可选, 默认为 self.image_std) — 当 do_normalize 设置为 True 时，用于归一化图像的标准差。
do_pad (bool, 可选, 默认为 self.do_pad) — 是否对图像进行填充。
pad_size (dict[str, int], 可选, 默认为 self.pad_size) — 控制应用于图像的填充大小。当 do_pad 设置为 True 时，图像将被填充到 pad_size["height"] 和 pad_size["width"]。
mask_pad_size (dict[str, int], 可选, 默认为 self.mask_pad_size) — 控制应用于分割图的填充大小。当 do_pad 设置为 True 时，图像将被填充到 mask_pad_size["height"] 和 mask_pad_size["width"]。
do_convert_rgb (bool, 可选, 默认为 self.do_convert_rgb) — 是否将图像转换为 RGB。
return_tensors (str 或 TensorType, 可选) — 返回张量的类型。可以是以下之一：
- 未设置：返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回 tf.Tensor 批次。
- TensorType.PYTORCH 或 'pt'：返回 torch.Tensor 批次。
- TensorType.NUMPY 或 'np'：返回 np.ndarray 批次。
- TensorType.JAX 或 'jax'：返回 jax.numpy.ndarray 批次。
data_format (ChannelDimension 或 str, 可选, 默认为 ChannelDimension.FIRST) — 输出图像的通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：图像格式为 (height, width, num_channels)。
- 未设置：使用输入图像的通道维度格式。
input_data_format (ChannelDimension 或 str, 可选) — 输入图像的通道维度格式。如果未设置，则从输入图像中推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：图像格式为 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE：图像格式为 (height, width)。

预处理一张或一批图像。

resize

( image: ndarray size: dict resample: Resampling = <Resampling.BICUBIC: 3> data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None **kwargs ) → np.ndarray

参数

image (np.ndarray) — 需要调整大小的图像。
size (dict[str, int]) — 格式为 {"longest_edge": int} 的字典，指定输出图像的尺寸。图像的最长边将被调整为指定的大小，而另一边则会相应调整以保持宽高比。
resample — 调整图像大小时使用的 PILImageResampling 滤波器，例如 PILImageResampling.BILINEAR。
data_format (ChannelDimension 或 str, 可选) — 输出图像的通道维度格式。如果未设置，则使用输入图像的通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：图像格式为 (height, width, num_channels)。
input_data_format (ChannelDimension 或 str, 可选) — 输入图像的通道维度格式。如果未设置，则从输入图像中推断通道维度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：图像格式为 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：图像格式为 (height, width, num_channels)。

np.ndarray

调整大小后的图像。

将图像调整为 (size["height"], size["width"])。

SamVisionModel

class transformers.SamVisionModel

( config: SamVisionConfig )

参数

config (SamVisionConfig) — 模型配置类，包含模型的所有参数。使用配置文件初始化不会加载与模型相关的权重，只会加载配置。请查阅 from_pretrained() 方法来加载模型权重。

来自 Sam 的视觉模型，没有任何头或顶层投影。

该模型继承自 PreTrainedModel。请查阅超类文档，了解库为所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

该模型也是 PyTorch torch.nn.Module 的子类。可以像常规的 PyTorch 模块一样使用它，并参考 PyTorch 文档了解所有与常规用法和行为相关的事项。

forward

( pixel_values: typing.Optional[torch.FloatTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.models.sam.modeling_sam.SamVisionEncoderOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor，形状为 (batch_size, num_channels, image_size, image_size), 可选) — 对应于输入图像的张量。像素值可以使用 {image_processor_class} 获得。详情请参见 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 处理图像）。
output_attentions (bool, 可选) — 是否返回所有注意力层的注意力张量。更多细节请参见返回张量下的 attentions。
output_hidden_states (bool, 可选) — 是否返回所有层的隐藏状态。更多细节请参见返回张量下的 hidden_states。
return_dict (bool, 可选) — 是否返回一个 ModelOutput 而不是一个普通的元组。

transformers.models.sam.modeling_sam.SamVisionEncoderOutput 或 tuple(torch.FloatTensor)

一个 transformers.models.sam.modeling_sam.SamVisionEncoderOutput 或一个 torch.FloatTensor 的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），包含各种元素，具体取决于配置（SamConfig）和输入。

image_embeds (torch.FloatTensor，形状为 (batch_size, output_dim)，可选，当模型以 with_projection=True 初始化时返回) — 通过将投影层应用于 pooler_output 获得的图像嵌入。
last_hidden_state (形状为 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor, 可选，默认为 None) — 模型最后一层输出的隐藏状态序列。
hidden_states (tuple[torch.FloatTensor, ...], 可选, 当传递 output_hidden_states=True 或 config.output_hidden_states=True 时返回) — 形状为 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor 元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出）。

模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple[torch.FloatTensor, ...], 可选, 当传递 output_attentions=True 或 config.output_attentions=True 时返回) — 形状为 (batch_size, num_heads, sequence_length, sequence_length) 的 torch.FloatTensor 元组（每层一个）。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

SamVisionModel 的前向方法，覆盖了 __call__ 特殊方法。

虽然前向传递的配方需要在此函数内定义，但之后应调用 Module 实例而不是此函数，因为前者会处理运行前处理和后处理步骤，而后者会静默地忽略它们。

SamModel

class transformers.SamModel

( config )

参数

config (SamModel) — 包含模型所有参数的模型配置类。使用配置文件进行初始化不会加载与模型相关的权重，仅加载配置。请查看 from_pretrained() 方法来加载模型权重。

Segment Anything Model (SAM)，用于在给定输入图像的情况下生成分割掩码。

该模型继承自 PreTrainedModel。请查阅超类文档，了解库为所有模型实现的通用方法（例如下载或保存、调整输入嵌入大小、修剪头部等）。

该模型也是 PyTorch torch.nn.Module 的子类。可以像常规的 PyTorch 模块一样使用它，并参考 PyTorch 文档了解所有与常规用法和行为相关的事项。

forward

( pixel_values: typing.Optional[torch.FloatTensor] = None input_points: typing.Optional[torch.FloatTensor] = None input_labels: typing.Optional[torch.LongTensor] = None input_boxes: typing.Optional[torch.FloatTensor] = None input_masks: typing.Optional[torch.LongTensor] = None image_embeddings: typing.Optional[torch.FloatTensor] = None multimask_output: bool = True attention_similarity: typing.Optional[torch.FloatTensor] = None target_embedding: typing.Optional[torch.FloatTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None **kwargs ) → transformers.models.sam.modeling_sam.SamImageSegmentationOutput 或 tuple(torch.FloatTensor)

参数

pixel_values (torch.FloatTensor，形状为 (batch_size, num_channels, image_size, image_size)，可选) — 与输入图像对应的张量。像素值可以使用 {image_processor_class} 获取。有关详细信息，请参阅 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 处理图像）。
input_points (torch.FloatTensor，形状为 (batch_size, num_points, 2)) — 输入的二维空间点，用于提示编码器对提示进行编码。通常能产生更好的结果。可以通过向处理器传递一个列表的列表的列表来获得这些点，处理器将创建相应的维度为 4 的 torch 张量。第一个维度是图像批量大小，第二个维度是点批量大小（即我们希望模型为每个输入点预测多少个分割掩码），第三个维度是每个分割掩码的点数（可以为单个掩码传递多个点），最后一个维度是点的 x（垂直）和 y（水平）坐标。如果为每个图像或每个掩码传递了不同数量的点，处理器将创建对应于 (0, 0) 坐标的“PAD”点，并且将使用标签跳过这些点的嵌入计算。
input_labels (torch.LongTensor，形状为 (batch_size, point_batch_size, num_points)) — 点的输入标签，用于提示编码器对提示进行编码。根据官方实现，有 3 种类型的标签
- 1：该点是包含感兴趣对象的点
- 0：该点是不包含感兴趣对象的点
- -1：该点对应于背景
我们添加了标签：
- -10：该点是填充点，因此应被提示编码器忽略
填充标签应由处理器自动完成。
input_boxes (torch.FloatTensor，形状为 (batch_size, num_boxes, 4)) — 点的输入框，用于提示编码器对提示进行编码。通常能生成更好的掩码。可以通过向处理器传递一个列表的列表的列表来获得这些框，处理器将生成一个 torch 张量，每个维度分别对应于图像批量大小、每张图像的框数以及框的左上角和右下角点的坐标。顺序为 (x1, y1, x2, y2):
- x1：输入框左上角的 x 坐标
- y1：输入框左上角的 y 坐标
- x2：输入框右下角的 x 坐标
- y2：输入框右下角的 y 坐标
input_masks (torch.FloatTensor，形状为 (batch_size, image_size, image_size)) — SAM 模型也接受分割掩码作为输入。掩码将被提示编码器嵌入以生成相应的嵌入，然后将其馈送到掩码解码器。这些掩码需要由用户手动提供，并且形状必须为 (batch_size, image_size, image_size)。
image_embeddings (torch.FloatTensor，形状为 (batch_size, output_channels, window_size, window_size)) — 图像嵌入，用于掩码解码器生成掩码和 iou 分数。为了进行更节省内存的计算，用户可以首先使用 get_image_embeddings 方法检索图像嵌入，然后将它们馈送到 forward 方法，而不是馈送 pixel_values。
multimask_output (bool，可选) — 在原始实现和论文中，模型总是为每张图像输出 3 个掩码（如果相关，则为每个点/每个边界框）。但是，通过指定 multimask_output=False，可以只输出一个掩码，该掩码对应于“最佳”掩码。
attention_similarity (torch.FloatTensor，可选) — 注意力相似性张量，在模型用于个性化时提供给掩码解码器以进行目标导向的注意力，如 PerSAM 中所述。
target_embedding (torch.FloatTensor，可选) — 目标概念的嵌入，在模型用于个性化时提供给掩码解码器以进行目标语义提示，如 PerSAM 中所述。
output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的 attentions。
output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的 hidden_states。

transformers.models.sam.modeling_sam.SamImageSegmentationOutput 或 tuple(torch.FloatTensor)

一个 transformers.models.sam.modeling_sam.SamImageSegmentationOutput 或一个 torch.FloatTensor 的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），根据配置（SamConfig）和输入，包含各种元素。

iou_scores (torch.FloatTensor，形状为 (batch_size, num_masks)) — 预测掩码的 iou 分数。
pred_masks (torch.FloatTensor，形状为 (batch_size, num_masks, height, width)) — 预测的低分辨率掩码。需要由处理器进行后处理。
vision_hidden_states (tuple(torch.FloatTensor)，可选，在传递 output_hidden_states=True 或 config.output_hidden_states=True 时返回) — torch.FloatTensor 的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出），形状为 (batch_size, sequence_length, hidden_size)。

视觉模型在每层输出处的隐藏状态，以及可选的初始嵌入输出。
vision_attentions (tuple(torch.FloatTensor)，可选，在传递 output_attentions=True 或 config.output_attentions=True 时返回) — torch.FloatTensor 的元组（每层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。
mask_decoder_attentions (tuple(torch.FloatTensor)，可选，在传递 output_attentions=True 或 config.output_attentions=True 时返回) — torch.FloatTensor 的元组（每层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

SamModel 的 forward 方法，重写了 __call__ 特殊方法。

示例

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoModel, AutoProcessor

>>> model = AutoModel.from_pretrained("facebook/sam-vit-base")
>>> processor = AutoProcessor.from_pretrained("facebook/sam-vit-base")

>>> img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/sam-car.png"
>>> raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
>>> input_points = [[[400, 650]]]  # 2D location of a window on the car
>>> inputs = processor(images=raw_image, input_points=input_points, return_tensors="pt")

>>> # Get segmentation mask
>>> outputs = model(**inputs)

>>> # Postprocess masks
>>> masks = processor.post_process_masks(
...     outputs.pred_masks, inputs["original_sizes"], inputs["reshaped_input_sizes"]
... )

TFSamVisionModel

class transformers.TFSamVisionModel

( config: SamVisionConfig **kwargs )

参数

config (SamConfig) — 包含模型所有参数的模型配置类。使用配置文件进行初始化不会加载与模型相关的权重，仅加载配置。请查看 from_pretrained() 方法来加载模型权重。

Sam 的视觉模型，没有任何头或投影。该模型继承自 TFPreTrainedModel。请查阅超类文档以了解库为所有模型实现的通用方法（如下载或保存、调整输入嵌入的大小、修剪头等）。

该模型也是 TensorFlow keras.Model 的子类。请像常规 TensorFlow 模型一样使用它，并参考 TensorFlow 文档了解所有与一般用法和行为相关的事项。

调用

( pixel_values: TFModelInputType | None = None output_attentions: bool | None = None output_hidden_states: bool | None = None return_dict: bool | None = None training: bool = False **kwargs ) → transformers.models.sam.modeling_tf_sam.TFSamVisionEncoderOutput 或 tuple(tf.Tensor)

参数

pixel_values (tf.Tensor，形状为 (batch_size, num_channels, height, width)) — 像素值。像素值可以使用 SamProcessor 获取。有关详细信息，请参阅 SamProcessor.__call__()。
output_attentions (bool，可选) — 是否返回所有注意力层的注意力张量。有关更多详细信息，请参阅返回张量下的 attentions。
output_hidden_states (bool，可选) — 是否返回所有层的隐藏状态。有关更多详细信息，请参阅返回张量下的 hidden_states。
return_dict (bool，可选) — 是否返回一个 ModelOutput 而不是一个普通的元组。

transformers.models.sam.modeling_tf_sam.TFSamVisionEncoderOutput 或 tuple(tf.Tensor)

一个 transformers.models.sam.modeling_tf_sam.TFSamVisionEncoderOutput 或一个 tf.Tensor 的元组（如果传递了 return_dict=False 或当 config.return_dict=False 时），根据配置（<class 'transformers.models.sam.configuration_sam.SamVisionConfig'>）和输入，包含各种元素。

image_embeds (tf.Tensor，形状为 (batch_size, output_dim)，可选，当模型以 with_projection=True 初始化时返回) — 通过将投影层应用于 pooler_output 获得的图像嵌入。
last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) — 模型最后一层输出的隐藏状态序列。
hidden_states (tuple(tf.Tensor)，可选，在传递 output_hidden_states=True 或 config.output_hidden_states=True 时返回) — tf.Tensor 的元组（一个用于嵌入层的输出，如果模型有嵌入层，+ 一个用于每层的输出），形状为 (batch_size, sequence_length, hidden_size)。

模型在每个层输出的隐藏状态以及可选的初始嵌入输出。
attentions (tuple(tf.Tensor)，可选，在传递 output_attentions=True 或 config.output_attentions=True 时返回) — tf.Tensor 的元组（每层一个），形状为 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 后的注意力权重，用于计算自注意力头中的加权平均值。

TFSamVisionModel 的 forward 方法，重写了 __call__ 特殊方法。

TFSamModel

class transformers.TFSamModel

( config **kwargs )

参数

config (SamConfig) — 包含模型所有参数的模型配置类。使用配置文件进行初始化不会加载与模型相关的权重，仅加载配置。请查看 from_pretrained() 方法来加载模型权重。

Segment Anything Model (SAM)，用于在给定输入图像以及可选的二维位置和边界框的情况下生成分割掩码。该模型继承自 TFPreTrainedModel。请查阅超类文档以了解库为所有模型实现的通用方法（如下载或保存、调整输入嵌入的大小、修剪头等）。

该模型也是 TensorFlow keras.Model 的子类。请像常规 TensorFlow 模型一样使用它，并参考 TensorFlow 文档了解所有与一般用法和行为相关的事项。

调用