Example notebook fails on run. Missing argument "fcstep"

#13

by hansbrenna - opened 28 days ago

28 days ago

Hi!

Thank you for releasing this to the community. I'm trying to get started running this model using your example notebook on our Azure Databricks infrastructure. I was able to run AIFS Single without any problems. When trying to run the checkpoint for AIFS ENS as in your notebook it fails with the following error message:

for state in runner.run(input_state=input_state, lead_time=12,):
print_state(state)

TypeError: AnemoiEnsModelEncProcDec.forward() missing 1 required keyword-only argument: 'fcstep'

How do I fix this?

hcookie129

ECMWF org 23 days ago

The environments for AIFS-Single and AIFS-Ens are different, particularly on anemoi-models. Can you confirm if you have the correct versions?

hansbrenna

23 days ago

Hi! Thank you for answering.

I did not use the same environment for both models. I started from the example notebook for aifs-ens (https://huggingface.co/ecmwf/aifs-ens-1.0/blob/main/run_AIFS_ENS_v1.ipynb). I installed the environment as described in the first cell there.

hcookie129

ECMWF org 23 days ago

In that case, can you please provide the full stack trace? And the result of anemoi-inference validate

hansbrenna

22 days ago

•

edited 19 days ago by

hcookie129

I'm running on Python 3.11.11

I'm unable to run anemoi-inference validate on our system. It fails like this:

cp.validate_environment()
NoSuchPathError: /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/__init__.py

This is the full stack trace:

TypeError: AnemoiEnsModelEncProcDec.forward() missing 1 required keyword-only argument: 'fcstep'
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/inference/runner.py:482, in Runner.predict_step(self, model, input_tensor_torch, **kwargs)
    481 try:
--> 482     return model.predict_step(input_tensor_torch, **kwargs)
    483 except TypeError:
    484     # This is for backward compatibility because old models did not
    485     # have kwargs in the forward or predict_step
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/inference/runner.py:482, in Runner.predict_step(self, model, input_tensor_torch, **kwargs)
    481 try:
--> 482     return model.predict_step(input_tensor_torch, **kwargs)
    483 except TypeError:
    484     # This is for backward compatibility because old models did not
    485     # have kwargs in the forward or predict_step
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/interface/__init__.py:129, in AnemoiModelInterface.predict_step(self, batch, model_comm_group, **kwargs)
    127     x = batch[:, 0 : self.multi_step, None, ...]  # add dummy ensemble dimension as 3rd index
--> 129     y_hat = self(x, model_comm_group=model_comm_group, **kwargs)
    131 return self.post_processors(y_hat, in_place=False)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/models/ens_encoder_processor_decoder.py:150, in AnemoiEnsModelEncProcDec.forward(self, x, fcstep, model_comm_group, **kwargs)
    148 processor_kwargs = {"cond": latent_noise} if latent_noise is not None else {}
--> 150 x_latent_proc = self.processor(
    151     x=x_latent_proc,
    152     batch_size=bse,
    153     shard_shapes=shard_shapes_hidden,
    154     model_comm_group=model_comm_group,
    155     **processor_kwargs,
    156 )
    158 x_latent_proc = x_latent_proc + x_latent
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/layers/processor.py:183, in TransformerProcessor.forward(self, x, batch_size, shard_shapes, model_comm_group, *args, **kwargs)
    179     assert (
    180         model_comm_group.size() == 1 or batch_size == 1
    181     ), "Only batch size of 1 is supported when model is sharded accross GPUs"
--> 183 (x,) = self.run_layers((x,), shape_nodes, batch_size, model_comm_group, **kwargs)
    185 return x
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/layers/processor.py:77, in BaseProcessor.run_layers(self, data, *args, **kwargs)
     76 for layer in self.proc:
---> 77     data = checkpoint(layer, *data, *args, **kwargs, use_reentrant=False)
     78 return data
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/_compile.py:32, in _disable_dynamo.<locals>.inner(*args, **kwargs)
     30     fn.__dynamo_disable = disable_fn
---> 32 return disable_fn(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:632, in DisableContext.__call__.<locals>._fn(*args, **kwargs)
    631 try:
--> 632     return fn(*args, **kwargs)
    633 finally:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/utils/checkpoint.py:496, in checkpoint(function, use_reentrant, context_fn, determinism_check, debug, *args, **kwargs)
    495 next(gen)
--> 496 ret = function(*args, **kwargs)
    497 # Runs post-forward logic
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/layers/chunk.py:147, in TransformerProcessorChunk.forward(self, x, shapes, batch_size, model_comm_group, **kwargs)
    146 for i in range(self.num_layers):
--> 147     x = self.blocks[i](x, shapes, batch_size, model_comm_group=model_comm_group, **kwargs)
    149 return (x,)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/layers/block.py:122, in TransformerProcessorBlock.forward(self, x, shapes, batch_size, model_comm_group, **layer_kwargs)
    114 def forward(
    115     self,
    116     x: Tensor,
   (...)
    120     **layer_kwargs,
    121 ) -> Tensor:
--> 122     x = x + self.attention(
    123         self.layer_norm_attention(x, **layer_kwargs), shapes, batch_size, model_comm_group=model_comm_group
    124     )
    125     x = x + self.mlp(
    126         self.layer_norm_mlp(
    127             x,
    128             **layer_kwargs,
    129         )
    130     )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/layers/attention.py:165, in MultiHeadSelfAttention.forward(self, x, shapes, batch_size, model_comm_group)
    163     key = self.k_norm(key)
--> 165 out = self.attention(
    166     query,
    167     key,
    168     value,
    169     batch_size,
    170     causal=False,
    171     window_size=self.window_size,
    172     dropout_p=dropout_p,
    173     softcap=self.softcap,
    174     alibi_slopes=self.alibi_slopes,
    175 )
    177 out = shard_sequence(out, shapes=shapes, mgroup=model_comm_group)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/layers/attention.py:278, in FlashAttentionWrapper.forward(self, query, key, value, batch_size, causal, window_size, dropout_p, softcap, alibi_slopes)
    276 alibi_slopes = alibi_slopes.repeat(batch_size, 1).to(query.device) if alibi_slopes is not None else None
--> 278 out = self.attention(
    279     query,
    280     key,
    281     value,
    282     causal=False,
    283     window_size=(window_size, window_size),
    284     dropout_p=dropout_p,
    285     softcap=softcap,
    286     alibi_slopes=alibi_slopes,
    287 )
    288 out = einops.rearrange(out, "batch grid heads vars -> batch heads grid vars")
TypeError: flash_attn_func() got an unexpected keyword argument 'softcap'

During handling of the above exception, another exception occurred:
TypeError                                 Traceback (most recent call last)
File <command-4551805964158823>, line 1
----> 1 for state in runner.run(input_state=input_state, lead_time=6,):
      2     print_state(state)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/inference/runner.py:222, in Runner.run(self, input_state, lead_time)
    219     input_tensor = self.prepare_input_tensor(input_state)
    221 try:
--> 222     yield from self.forecast(lead_time, input_tensor, input_state)
    223 except (TypeError, ModuleNotFoundError, AttributeError):
    224     if self.report_error:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/inference/runner.py:590, in Runner.forecast(self, lead_time, input_tensor_numpy, input_state)
    584 # Predict next state of atmosphere
    585 with (
    586     torch.autocast(device_type=self.device, dtype=self.autocast),
    587     ProfilingLabel("Predict step", self.use_profiler),
    588     Timer(title),
    589 ):
--> 590     y_pred = self.predict_step(self.model, input_tensor_torch, fcstep=s, step=step, date=date)
    592 # Detach tensor and squeeze (should we detach here?)
    593 with ProfilingLabel("Sending output to cpu", self.use_profiler):
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/inference/runner.py:486, in Runner.predict_step(self, model, input_tensor_torch, **kwargs)
    482     return model.predict_step(input_tensor_torch, **kwargs)
    483 except TypeError:
    484     # This is for backward compatibility because old models did not
    485     # have kwargs in the forward or predict_step
--> 486     return model.predict_step(input_tensor_torch)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/interface/__init__.py:129, in AnemoiModelInterface.predict_step(self, batch, model_comm_group, **kwargs)
    125     # Dimensions are
    126     # batch, timesteps, horizonal space, variables
    127     x = batch[:, 0 : self.multi_step, None, ...]  # add dummy ensemble dimension as 3rd index
--> 129     y_hat = self(x, model_comm_group=model_comm_group, **kwargs)
    131 return self.post_processors(y_hat, in_place=False)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

hcookie129

ECMWF org 19 days ago

Something looks particuarly funky with your environment then if the validate fails. Please remove and rebuild your environment with the versions from the notebook.
Afterwards, with a downloaded copy of the checkpoint, run
anemoi-inference validate CKPT_PATH_HERE

Btw, the real issue seems to be within the flash_attn wrapper implementation, which has changed slightly in recent versions

hansbrenna

15 days ago

•

edited 15 days ago

From further investigation, the failure to evaluate seems to be caused by the way Databricks manages python environments. For reference, I get the same error on environment validation in the AIFS-single environment, which then runs the model successfully. Environment validation through the command line tool seems to work, though.

%sh anemoi-inference validate /Volumes/cdp_dev_sandbox_catalog_01/weather_enriched/enriched/aifs-ens-crps-1.0.ckpt

2025-10-14 11:19:26 WARNING Environment validation failed. The following issues were found:
  python:
    Python version mismatch: 3.11.6 != 3.11.11
  mismatch:
    Version of module anemoi.utils was lower in training than in inference: 0.4.22 <= 0.4.37

Do you know what I should do about the flash_attn wrapper problem?

hansbrenna

14 days ago

•

edited 14 days ago

I spent a bit of time yesterday trying to understand what happens inside runner.run. Building the input tensor and manually calling the model predict step method with fcstep as a kwarg produced the following error:

TypeError: flash_attn_func() got an unexpected keyword argument 'softcap'

hansbrenna changed discussion status to closed 2 days ago

hansbrenna

2 days ago

•

edited 2 days ago

Hello again. Do you have any further information about how to get around the fcstep issue?

hansbrenna changed discussion status to open 2 days ago

hcookie129

ECMWF org 1 day ago

I'm really not sure, we have only see this issue with incorrect environments, this should not be happening if all is installed correctly.
Which flash_attn version do you have?

hansbrenna

about 23 hours ago

The version I've been having problems with is
flash-attn 2.5.9.post1

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment