One Confusion about model architecture
Hi, authors!
Thank you for your good job! Could you please clarify the model architecture of PathGen-LLaVA? According to the config file, the LLM is vicuna-13b-v1.5, the projector is 2-layer MLP, the vision encoder is clip-vit-large-patch14, are these correct? If so, why doesn't the checkpoint repository contain a mm_projector.bin file?
Thank you for your any help!
Hi, authors!
Thank you for your good job! Could you please clarify the model architecture of
PathGen-LLaVA? According to the config file, the LLM isvicuna-13b-v1.5, the projector is 2-layer MLP, the vision encoder isclip-vit-large-patch14, are these correct? If so, why doesn't the checkpoint repository contain amm_projector.binfile?Thank you for your any help!
Hi! Thank you for your attention to our work.
To clarify:
Regarding the missing mm_projector.bin file, this is because PathGen-LLaVA is a fully fine-tuned version of LLaVA where we retrained the entire model, not just the MLP projector. The MLP projector weights are already integrated into the overall model weights since we performed end-to-end training on all components. This is consistent with LLaVA's stage 1 and stage 2 training approach, but with comprehensive retraining rather than just projector fine-tuning.
Therefore, you can simply load our PathGen-LLaVA the same way you would load LLaVA.
Additionally, for the vision encoder, while it follows the clip-vit-large-patch14 architecture, we have retrained this CLIP model. You need to replace the original vision encoder with the weights from PathGen-CLIP-L (https://huggingface.co/jamessyx/pathgenclip-vit-large-patch14-hf).