--- library_name: transformers base_model: - TIGER-Lab/MAmmoTH2-8B - rombodawg/rombos_Replete-Coder-Llama3-8B tags: - merge --- Code + Math Llama3-8B merged with RegMean algorithm. See details in https://github.com/AuCson/RegMean-LLama3-8B. ## Fast and Numerically Stable RegMean for Merging LLama3-8B This repo is a fast and numerically stable re-implementation of RegMean model merging algorithm for LLama3-8B. We merge the following two models. - [Code Model] [rombodawg/rombos_Replete-Coder-Llama3-8B](https://huggingface.co/rombodawg/rombos_Replete-Coder-Llama3-8B) (Re-implementation of Replete-Coder) - [Math Model] [TIGER-Lab/MAmmoTH2-8B](https://huggingface.co/TIGER-Lab/MAmmoTH2-8B) ## Results | Method/Benchmark | GSM8k (Math) | Mathqa (Math) | HumanEval-Instruct (Code) | MBPP (Code) | | ---- | ---- | ---- | ---- | ---- | | | 5-shot EM | 0-shot Acc-norm | 0-shot Pass@1 | 3-shot Pass@1 | | [Math Model](https://huggingface.co/TIGER-Lab/MAmmoTH2-8B) | 70.40* | 43.85 | 36.59 | 40.04 | | [Code Model](https://huggingface.co/rombodawg/rombos_Replete-Coder-Llama3-8B) | 57.92 | 37.35 | 42.07 | 49.20 | | [Average](https://huggingface.co/aucson/llama3-code-math-avg-merge) | 65.27 | 44.05 | 43.29 | 47.80 | | [RegMean ($\alpha$=0.1)](https://huggingface.co/aucson/llama3-code-math-regmean-merge/tree/main) | 68.31 | 44.99 | 44.51 | 45.20 | \* Official result \* We found the zero-shot results are sensitive to chat templates and reported best achievable result for HumanInstruct for all models: we modified `lm-evaluation-harness/lm_eval/tasks/humaneval/humaneval.yaml` so that "\`\`\`" can be considered as end of responses. The merged models, along with the activation inner product matrices, are avaiable on the huggingface hub. ## What's new? RegMean solves a least square regression problem at each linear layer of the transformer. This is now implemented with built-in PyTorch linalg.lstsq function. ```python # old # sum_gram_inv = torch.inverse(sum_gram) # wt = torch.matmul(sum_gram_inv, sum_gram_m_ws) # new wt = torch.linalg.lstsq(sum_gram, sum_gram_m_ws).solution ``` According to PyTorch's official doumentation, ``` This function computes X = A.pinverse() @ B in a faster and more numerically stable way than performing the computations separately. ``` ## Computational efficiency - **Computing gram matrices**: We compute inner product matrics for code and math models on 10k training examples. Each of them take 3-hour on one Quadro RTX A6000 GPU (which can probably accelerated with more efficient LLM inference framework). But we have uploaded them under the [merged model repo](https://huggingface.co/aucson/llama3-code-math-regmean-merge/tree/main) so that you do not need to re-compute. - **Merging Models**: ~2 minutes on the same GPU for this re-implementation. Please note loading two 8B models and (almost) equally sized inner product matrices at once can take >150GB memory. ## Reproducing the results 1. Create a python environment and install the modified lm-eval-harness library for evaluating merged models. ``` cd lm-eval-harness pip install -e . ``` The only modification is `lm_eval/tasks/humaneval/humaneval.yaml`. 2. Preparing activation inner product matrices. You can download them from the [merged model repo](https://huggingface.co/aucson/llama3-code-math-regmean-merge/tree/main) and place them under `runs/merges/math-llama3/gram.pkl` and `runs/merges/code-llama3/gram.pkl`. Alternatively, you can compute them yourself with, ``` python compute_gram.py code python compute_gram.py math ``` 3. Merging models ``` python merge_model.py avg python merge_model.py regmean ``` 4. Evaluation with `lm-eval-harness`. Please follow the safety guidelines of humaneval and mbpp regarding execution of LLM generated code. ``` merge_exp=regmean_0.1 # merge_exp=avg HF_ALLOW_CODE_EVAL=1 lm_eval --model vllm --model_args pretrained=runs/merges/${merge_exp},tokenizer=meta-llama/Meta-Llama-3-8B,tensor_parallel_size=1,dtype=bfloat16 --tasks mathqa,gsm8k,humaneval_instruct,mbpp --output_path runs/merges/${merge_exp}/lm_eval_results_preds --log_samples --trust_remote_code --confirm_run_unsafe_code ``` ## Caveats Overall, simple averaging works well for LLMs and the benefits of merging algorithms diminishes for merging algorithms [1] ## Citations For the RegMean algorithm. ``` @inproceedings{ jin2023dataless, title={Dataless Knowledge Fusion by Merging Weights of Language Models}, author={Xisen Jin and Xiang Ren and Daniel Preotiuc-Pietro and Pengxiang Cheng}, booktitle={The Eleventh International Conference on Learning Representations }, year={2023}, url={https://openreview.net/forum?id=FCnohuR6AnM} } ``` Here are other useful references that greatly inspired this re-implementation. [1] Yadav et al. 2024, [What Matters for Model Merging at Scale?](https://arxiv.org/abs/2410.03617) [2] Tam et al. 2024, [Merging by Matching Models in Task Parameter Subspaces](https://openreview.net/forum?id=qNGo6ghWFB)