The released final ckpt and stage2-ingredient1-step23852-tokens51B ckpt have different eval results

by wydwww - opened May 28, 2025

May 28, 2025

•

edited May 28, 2025

As mentioned in #1, the released final checkpoint corresponds to ingredient 1, stage2-ingredient1-step23852-tokens51B. I use lm-evaluation-harness to evaluate allenai/OLMo-2-0425-1B and stage2-ingredient1-step23852-tokens51B, and they have different results on MMLU and gsm8k.

Can you please clarify why the released ckpt has lower evaluation results? Thanks.

MMLU:
released final:

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.4257|±  |0.0041|
| - humanities     |      2|none  |      |acc   |↑  |0.3947|±  |0.0069|
| - other          |      2|none  |      |acc   |↑  |0.4870|±  |0.0088|
| - social sciences|      2|none  |      |acc   |↑  |0.4807|±  |0.0089|
| - stem           |      2|none  |      |acc   |↑  |0.3578|±  |0.0084|

stage2-ingredient1-step23852-tokens51B:

|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|------|------|---|-----:|---|-----:|
|mmlu              |      2|none  |      |acc   |↑  |0.4417|±  |0.0041|
| - humanities     |      2|none  |      |acc   |↑  |0.4136|±  |0.0069|
| - other          |      2|none  |      |acc   |↑  |0.4957|±  |0.0088|
| - social sciences|      2|none  |      |acc   |↑  |0.5018|±  |0.0088|
| - stem           |      2|none  |      |acc   |↑  |0.3717|±  |0.0085|

gsm8k:
released final:

hf (pretrained=allenai/OLMo-2-0425-1B,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     4|exact_match|↑  |0.4079|±  |0.0135|
|         |       |strict-match    |     4|exact_match|↑  |0.4003|±  |0.0135|

stage2-ingredient1-step23852-tokens51B:

hf (pretrained=allenai/OLMo-2-0425-1B,revision=stage2-ingredient1-step23852-tokens51B,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     4|exact_match|↑  |0.4594|±  |0.0137|
|         |       |strict-match    |     4|exact_match|↑  |0.4223|±  |0.0136|

I use the same evaluation setting:

lm_eval --model hf \
    --model_args pretrained=allenai/OLMo-2-0425-1B(,revision=stage2-ingredient1-step23852-tokens51B) \
    --tasks gsm8k_cot\
    --batch_size auto \
    --num_fewshot 4 \
    --trust_remote_code \
    --confirm_run_unsafe_code

Also the description in allenai/OLMo claims that the released main ckpt is merged from soup, which are different from the description on the hf model page and #1.

amanrangapur

May 28, 2025

Hey @wydwww , thanks for raising this issue. I have cross verified with the team on this again.

There is no model souping (there was a typo in README file on Github OLMo repo, I fixed it).
From my #1 comment, I was wrong. Ingredient 3 is seed 42 and it is the final main checkpoint. Not the ingredients 1 and 2, they are just exploratory anneals. I addressed it in #1.
To clear out things, I have updated the readme.

Sorry for the inconvenience. You can retry the evals.

amanrangapur changed discussion status to closed May 28, 2025

wydwww

May 29, 2025

•

edited May 29, 2025

Thanks for your reply @amanrangapur . I ran the gsm8k eval of stage2-ingredient3-step23852-tokens51B with the same command, and still got a significantly higher result (0.4549) than the main ckpt (0.4079). FYI, the ingredient 2 ckpt has a 0.4556 score in this setting. Did you use any post-processing to get the final ckpt?

hf (pretrained=allenai/OLMo-2-0425-1B,revision=stage2-ingredient3-step23852-tokens51B,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 4, batch_size: auto
|  Tasks  |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot|      3|flexible-extract|     4|exact_match|↑  |0.4549|±  |0.0137|
|         |       |strict-match    |     4|exact_match|↑  |0.4511|±  |0.0137|

amanrangapur

May 29, 2025

Hey @wydwww , we did not use any post-processing on final checkpoint. We selected one of the ingredients (anneals) based on average scores of evals.

wydwww

May 29, 2025

@amanrangapur It seems that the final ckpt does not match any of the 3 ingredient ckpts. Do you have some thoughts on this? Can you please verify the main and stage2-ingredient3-step23852-tokens51Bckpts are the same in your setting? Thanks.

yukiwuki

Mar 23

•

edited Mar 23

Hey @wydwww , it's been a while -- but did you manage to figure out why they are different?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment