One Confusion about model architecture

by Liusn - opened Jun 14

Jun 14

Hi, authors!

Thank you for your good job! Could you please clarify the model architecture of PathGen-LLaVA? According to the config file, the LLM is vicuna-13b-v1.5, the projector is 2-layer MLP, the vision encoder is clip-vit-large-patch14, are these correct? If so, why doesn't the checkpoint repository contain a mm_projector.bin file?

Thank you for your any help!

Liusn changed discussion status to closed Jun 14

Liusn changed discussion status to open Jun 15

jamessyx

Owner Jun 17

Hi, authors!

Thank you for your good job! Could you please clarify the model architecture of PathGen-LLaVA? According to the config file, the LLM is vicuna-13b-v1.5, the projector is 2-layer MLP, the vision encoder is clip-vit-large-patch14, are these correct? If so, why doesn't the checkpoint repository contain a mm_projector.bin file?

Thank you for your any help!

Hi! Thank you for your attention to our work.

To clarify:

Regarding the missing mm_projector.bin file, this is because PathGen-LLaVA is a fully fine-tuned version of LLaVA where we retrained the entire model, not just the MLP projector. The MLP projector weights are already integrated into the overall model weights since we performed end-to-end training on all components. This is consistent with LLaVA's stage 1 and stage 2 training approach, but with comprehensive retraining rather than just projector fine-tuning.

Therefore, you can simply load our PathGen-LLaVA the same way you would load LLaVA.

Additionally, for the vision encoder, while it follows the clip-vit-large-patch14 architecture, we have retrained this CLIP model. You need to replace the original vision encoder with the weights from PathGen-CLIP-L (https://huggingface.co/jamessyx/pathgenclip-vit-large-patch14-hf).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment