从多个微调的 Phi-3-Vision 模型构建视觉专家混合模型

社区文章发布于 2024 年 6 月 12 日

在这篇文章中，我们将描述一个从多个微调的 Phi-3-Vision 模型构建专家混合模型的逐步过程。该方法是我们 Cephalo 模型系列的一部分，Cephalo 模型系列是一组专注于材料科学的多模态视觉大型语言模型 (V-LLM)，旨在集成视觉和语言数据，以便在人机或多智能体 AI 框架中进行高级理解和交互。这些模型基于 Phi-3-Vision 和 Idefics-2 基础模型。

您可以在 lamm-mit/Cephalo-Phi-3-MoE-vision-128k-3x4b-beta 找到 Cephalo-MoE 模型的训练版本。模型架构如下所示

虽然本文重点介绍 Phi-3-Vision MoE 模型，但几乎相同的方法也可用于创建基于 Idefics-2 的 V-MoE 模型。请参阅：https://huggingface.co/lamm-mit/Cephalo-Idefics2-vision-3x8b-beta

下载 MoE 模型和示例推理代码

pip install transformers -U

import torch
from transformers import AutoModelForCausalLM, AutoProcessor,AutoConfig  

def count_parameters(model):
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    #number of parameters in b
    return total_params/1e9, trainable_params/1e9

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name_moe = "lamm-mit/Cephalo-Phi-3-MoE-vision-128k-3x4b-beta"

processor = AutoProcessor.from_pretrained(model_name_moe, trust_remote_code=True) 
moe_model = AutoModelForCausalLM.from_pretrained(
    model_name_moe,
    trust_remote_code=True,  torch_dtype=torch.bfloat16,    
).to(device)
count_parameters(moe_model)

从头开始使用多个预训练模型构建 Phi-3-V-MoE 模型

在此示例中，我们使用三个基于 Phi-3-Vision 的模型。首先，一个在材料科学数据上微调的模型。其次，一个微调用于将图像转换为 LaTeX 公式模型。第三，是基础 Phi-3-Vision 模型。通过将这三个模型集成到 MoE 模型中，所得模型将具有这些构建块的集体和集成能力，上下文长度为 128,000 个 token。

在这个过程中，我们首先采用基础模型，并使用一组微调或以其他方式训练的模型来创建多个专家模型。通常，每个专家模型都专注于输入数据的不同方面，从而实现更大的处理灵活性和效率。为了实现这一点，基础模型的原始层被替换为包含门控和专家机制的修改层。创建自定义配置类以扩展基础配置，添加特定于 MoE 设置的参数，例如专家数量以及每次前向调用中选择的专家数量，即 k。在算法中，基础模型中的原始 MLP 层被新的 MoE 层替换，该层结合了选定专家的输出。该 MoE 层使用门控分数来选择相关专家的输出，并通过计算加权和将它们组合成单个输出。然后将修改后的层重新集成到模型中，创建一个保留原始模型结构但具有增强功能的混合架构。

源代码也提供在 https://github.com/lamm-mit/Cephalo-Phi-3-MoE，其中包含更多详细信息。

以下是详细的步骤。首先，下载实现 Phi-3-V 和专家混合视觉模型的 .py 文件

pip install huggingface_hub

from huggingface_hub import HfApi, hf_hub_download
from tqdm.notebook import tqdm
import os
import shutil

# Repository details
repo_id = "lamm-mit/Cephalo-Phi-3-MoE-vision-128k-3x4b-beta"
api = HfApi()

# List all files in the repository
files_in_repo = api.list_repo_files(repo_id)

# Filter for .py files
py_files = [file for file in files_in_repo if file.endswith('.py')]

# Directory to save the downloaded files
save_dir = "./Phi_3V_MoE/"
os.makedirs(save_dir, exist_ok=True)

# Download each .py file
for file_name in tqdm(py_files):
    file_path = hf_hub_download(repo_id=repo_id, filename=file_name)
    new_path = os.path.join(save_dir, file_name)
    shutil.move(file_path, new_path)
    print(f"Downloaded: {file_name}")

print("Download completed.")

下载将组成专家的模型以及基础模型

from Phi_3V_MoE.moe_phi3_v import Phi3VForCausalLMMoE, Phi3VForCausalLMMoEConfig

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#Model specialized in bio-inspired/mechanics and materials
model_name_1 = f"lamm-mit/Cephalo-Phi-3-vision-128k-4b-beta"
model_1 = AutoModelForCausalLM.from_pretrained(
    model_name_1,
    trust_remote_code=True,  torch_dtype=torch.bfloat16, 
    
).to(device)

#Original model
model_name_2 = f"microsoft/Phi-3-vision-128k-instruct"
model_2 = AutoModelForCausalLM.from_pretrained(
    model_name_2,
    trust_remote_code=True,  torch_dtype=torch.bfloat16, 
    
).to(device)

#Model trained on conversion of images to LaTeX formulas
model_name_3 = f"lamm-mit/Cephalo-LaTeX-Phi-3-vision-128k-4b-beta"
model_3 = AutoModelForCausalLM.from_pretrained(
    model_name_3,
    trust_remote_code=True,  torch_dtype=torch.bfloat16, 
    
).to(device)

dtype = torch.bfloat16  # Desired dtype for new layers in MoE model

# Initialize the models
base_model = copy.deepcopy(model_2)  # Your base model
expert_models = [model_1, model_2,  model_3  ]  # List of expert models
 
# Load a processor (e.g. from base model)
processor = AutoProcessor.from_pretrained(model_name_2, trust_remote_code=True) 

# Create the config
config =  AutoConfig.from_pretrained(model_name_2, trust_remote_code=True)

# Create the MoE model
moe_config = Phi3VForCausalLMMoEConfig(config=config, k=1, num_expert_models=len (expert_models))
moe_model = Phi3VForCausalLMMoE(moe_config, base_model, expert_models,  layer_dtype = dtype).to(device)

count_parameters(expert_models[0]),count_parameters(moe_model)

训练门控网络

要训练门控网络，您需要为每个专家提供示例提示。示例提示包括文本和图像数据。您必须匹配您拥有的专家数量，即上面定义的 k。

要获取文本数据，您可以使用处理器/聊天模板

messages = [ {"role": "user", "content": "<|image_1|>\nWhat is shown in this image, and what is the relevance for materials design?"}, ]
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
prompt

在以下示例中，我们展示了门控层的训练是如何完成的。训练集由图像和提示组成。列表中的第一项是专家 1 的提示，第二项是专家 2 的提示，依此类推。

示例训练集和训练过程（为简单起见，我们只使用三张图片，每张图片都具有每个专家的特征）

from PIL import Image
import requests

image_1 = Image.open(requests.get("https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg", stream=True).raw) 
image_2 = Image.open(requests.get("https://https://images.pexels.com/photos/106399/pexels-photo-106399.jpeg", stream=True).raw) 
image_3 = Image.open(requests.get("https://upload.wikimedia.org/wikipedia/commons/a/a0/Euplectella_aspergillum_Okeanos.jpg", stream=True).raw) 

prompts_per_expert = [
    [{"text": "<|user|>\n<|image_1|>\nPrompt 1 for expert 1<|end|>\n<|assistant|>\n", "image": [image_1]}, 
     {"text": "<|user|>\n<|image_1|>\nPrompt 2 for expert 1<|end|>\n<|assistant|>\n", "image": [image_1]}],

    [{"text": "<|user|>\n<|image_1|>\nPrompt 1 for expert 2<|end|>\n<|assistant|>\n", "image": [image_2]}, 
     {"text": "<|user|>\n<|image_1|>\nPrompt 2 for expert 2<|end|>\n<|assistant|>\n", "image": [image_2]}],

    [{"text": "<|user|>\n<|image_1|>\nPrompt 1 for expert 3<|end|>\n<|assistant|>\n", "image": [image_3]}, 
     {"text": "<|user|>\n<|image_1|>\nPrompt 2 for expert 3<|end|>\n<|assistant|>\n", "image": [image_3]}],
]

# Train gating layers using the provided prompts
gating_layer_params = moe_model.train_gating_layer_params_from_hidden_states(processor, prompts_per_expert,
                                              epochs=1000,
                                              loss_steps=100,
                                              lr=5e-5,
                                          )

# Set parameters
moe_model.set_gating_layer_params(gating_layer_params)

准备门控网络进行全面训练

要冻结模型中除门控神经网络之外的所有参数，您可以使用

freeze_except_gating_layers(moe_model)
count_parameters(moe_model)

您可以解冻

un_freeze_all(moe_model)

定义 FT_repo_id 以推送到 HF hub/保存模型

FT_repo_id='xxxxx/' #<repo_ID>

from datasets import load_dataset

train_dataset = load_dataset("lamm-mit/Cephalo-Wikipedia-Materials", split="train")

import random

class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor

    def __call__(self, examples):
        texts = []
        images = []
        for example in examples:
            image = example["image"]
            question = example["query"] 
            answer = example["answer"]            
            messages = [ {
                            "role": "user",  "content": '<|image_1|>\n'+question},
                           {"role": "assistant", "content": f"{answer}"}, ]
                
            text = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
                
            images.append(image)
             
        batch = processor(text=text, images=[image], return_tensors="pt", padding=True
            
        labels = batch["input_ids"].clone() 
        labels[labels <0] = -100 

        batch["labels"] = labels

        return batch

data_collator = MyDataCollator(processor)

然后设置训练器，并进行训练

from transformers import TrainingArguments, Trainer

optim = "paged_adamw_8bit"

training_args = TrainingArguments(
    num_train_epochs=2,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    warmup_steps=250,
    learning_rate=1e-5,
    weight_decay=0.01,
    logging_steps=25,
    output_dir="output_training",
    optim=optim,
    save_strategy="steps",
    save_steps=1000,
    save_total_limit=16,
    #fp16=True,
    bf16=True,  
    push_to_hub_model_id=FT_repo_id,
    remove_unused_columns=False,
    report_to="none",
)

trainer = Trainer(
    model=moe_model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

trainer.train()

保存模型并推送到 Hugging Face Hub

您可以将 MoE 模型推送到 Hugging Face hub，或本地保存，如下所示

merged_name='Cephalo-Phi-3-MoE-vision-128k-3x4b'
repo_id= '...'
processor.push_to_hub (repo_id+'/'+merged_name, safe_serialization=False)
moe_model.push_to_hub (repo_id+'/'+merged_name, safe_serialization=False)

本地保存

merged_name='Cephalo-Phi-3-MoE-vision-128k-3x4b'
processor.save_pretrained(merged_name,safe_serialization=False)
moe_model.save_pretrained(merged_name,safe_serialization=False )

推理的更多细节

聊天格式

鉴于训练数据的性质，Cephalo-Phi-3-vision-128k-4b-beta 模型最适合使用以下聊天格式的单图像输入提示。您可以提供包含通用模板的单图像提示，如下所示

<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n

其中模型在 <|assistant|> 之后生成文本。对于多轮对话，提示应格式化如下

<|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n

示例推理代码

此代码片段展示了如何在 GPU 上快速启动

from PIL import Image 
import requests
from transformers import AutoModelForCausalLM, AutoProcessor,AutoConfig  

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name_moe = "lamm-mit/Cephalo-Phi-3-MoE-vision-128k-3x4b-beta"

processor = AutoProcessor.from_pretrained(model_name_moe, trust_remote_code=True) 
moe_model = AutoModelForCausalLM.from_pretrained(
    model_name_moe,
    trust_remote_code=True,  torch_dtype=torch.bfloat16,    
).to(device)

question = "What is shown in this image, and what is the relevance for materials design? Include a discussion of multi-agent AI."

messages = [ 
    {"role": "user", "content": f"<|image_1|>\n{question}"}, 
    ] 

url = "https://d2r55xnwy6nx47.cloudfront.net/uploads/2018/02/Ants_Lede1300.jpg" 

image = Image.open(requests.get(url, stream=True).raw) 

prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0") 

generation_args = { 
                    "max_new_tokens": 256, 
                    "temperature": 0.1, 
                    "do_sample": True, 
                    "stop_strings": ['<|end|>',
                                     '<|endoftext|>'],
                    "tokenizer": processor.tokenizer,
                  } 

generate_ids = moe_model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args) 

# remove input tokens 
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] 

print(response)

示例输出

图片来源：Vaishakh Manohar

The image shows a group of red ants (Solenopsis invicta) climbing over a vertical wooden post. The ants are using their long legs and antennae to navigate the rough surface of the wood, demonstrating their ability to adapt to different materials and environments. This behavior is relevant for materials design because it highlights the importance of considering the interactions between materials and living organisms, such as ants, when designing new materials.

Multi-agent AI (Artificial Intelligence) is a field of study that focuses on the development of AI systems that can work together with other AI systems to achieve a common goal. In the context of this image, multi-agent AI could be used to design materials that are more compatible with the natural behaviors of living organisms, such as ants, and that can adapt to different environments and conditions.

By studying the behavior of ants and other living organisms, researchers can gain insights into how materials can be designed to better interact with these organisms and to better mimic their natural behaviors. This can lead to the development of new materials that are more sustainable, efficient, and effective in a variety of applications.

In summary, the image of red ants climbing over a wooden post highlights the importance of considering the interactions between materials and living organisms when designing new materials, and the potential of multi-agent AI to help achieve this goal.

数据集生成

下面的示意图展示了用于训练视觉模型的数据集生成方法的视觉化。提取过程采用先进算法，从复杂的 PDF 文档中准确检测并分离图像及其相应的文本描述。它涉及从 PDF 中提取图像和标题，以创建经过充分推理的图像-文本对，并利用大型语言模型 (LLM) 进行自然语言处理。然后，这些图像-文本对通过基于 LLM 的 NLP 处理进行提炼和验证，确保高质量和上下文相关的数据用于训练。

下图显示了科学文章（此处为 Spivak, Buehler, et al., 2011）中两页的复制品，以及它们如何用于提取视觉科学数据以训练 Cephalo 模型。有关数据集生成和架构其他详细信息的更多信息，请参阅 https://arxiv.org/abs/2405.19076。

引文

请引用为

@article{Buehler_Cephalo_2024,
  title={Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design},
  author={M.J. Buehler},
  journal={arXiv preprint arXiv:2405.19076},
  year={2024}
}

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录以评论