Example notebook fails on run. Missing argument "fcstep"
Hi!
Thank you for releasing this to the community. I'm trying to get started running this model using your example notebook on our Azure Databricks infrastructure. I was able to run AIFS Single without any problems. When trying to run the checkpoint for AIFS ENS as in your notebook it fails with the following error message:
for state in runner.run(input_state=input_state, lead_time=12,):
print_state(state)
TypeError: AnemoiEnsModelEncProcDec.forward() missing 1 required keyword-only argument: 'fcstep'
How do I fix this?
The environments for AIFS-Single and AIFS-Ens are different, particularly on anemoi-models. Can you confirm if you have the correct versions?
Hi! Thank you for answering.
I did not use the same environment for both models. I started from the example notebook for aifs-ens (https://huggingface.co/ecmwf/aifs-ens-1.0/blob/main/run_AIFS_ENS_v1.ipynb). I installed the environment as described in the first cell there.
In that case, can you please provide the full stack trace? And the result of anemoi-inference validate
I'm running on Python 3.11.11
I'm unable to run anemoi-inference validate on our system. It fails like this:
cp.validate_environment()
NoSuchPathError: /databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/__init__.py
This is the full stack trace:
TypeError: AnemoiEnsModelEncProcDec.forward() missing 1 required keyword-only argument: 'fcstep'
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/inference/runner.py:482, in Runner.predict_step(self, model, input_tensor_torch, **kwargs)
481 try:
--> 482 return model.predict_step(input_tensor_torch, **kwargs)
483 except TypeError:
484 # This is for backward compatibility because old models did not
485 # have kwargs in the forward or predict_step
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/inference/runner.py:482, in Runner.predict_step(self, model, input_tensor_torch, **kwargs)
481 try:
--> 482 return model.predict_step(input_tensor_torch, **kwargs)
483 except TypeError:
484 # This is for backward compatibility because old models did not
485 # have kwargs in the forward or predict_step
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/interface/__init__.py:129, in AnemoiModelInterface.predict_step(self, batch, model_comm_group, **kwargs)
127 x = batch[:, 0 : self.multi_step, None, ...] # add dummy ensemble dimension as 3rd index
--> 129 y_hat = self(x, model_comm_group=model_comm_group, **kwargs)
131 return self.post_processors(y_hat, in_place=False)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1749 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/models/ens_encoder_processor_decoder.py:150, in AnemoiEnsModelEncProcDec.forward(self, x, fcstep, model_comm_group, **kwargs)
148 processor_kwargs = {"cond": latent_noise} if latent_noise is not None else {}
--> 150 x_latent_proc = self.processor(
151 x=x_latent_proc,
152 batch_size=bse,
153 shard_shapes=shard_shapes_hidden,
154 model_comm_group=model_comm_group,
155 **processor_kwargs,
156 )
158 x_latent_proc = x_latent_proc + x_latent
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1749 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/layers/processor.py:183, in TransformerProcessor.forward(self, x, batch_size, shard_shapes, model_comm_group, *args, **kwargs)
179 assert (
180 model_comm_group.size() == 1 or batch_size == 1
181 ), "Only batch size of 1 is supported when model is sharded accross GPUs"
--> 183 (x,) = self.run_layers((x,), shape_nodes, batch_size, model_comm_group, **kwargs)
185 return x
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/layers/processor.py:77, in BaseProcessor.run_layers(self, data, *args, **kwargs)
76 for layer in self.proc:
---> 77 data = checkpoint(layer, *data, *args, **kwargs, use_reentrant=False)
78 return data
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/_compile.py:32, in _disable_dynamo.<locals>.inner(*args, **kwargs)
30 fn.__dynamo_disable = disable_fn
---> 32 return disable_fn(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py:632, in DisableContext.__call__.<locals>._fn(*args, **kwargs)
631 try:
--> 632 return fn(*args, **kwargs)
633 finally:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/utils/checkpoint.py:496, in checkpoint(function, use_reentrant, context_fn, determinism_check, debug, *args, **kwargs)
495 next(gen)
--> 496 ret = function(*args, **kwargs)
497 # Runs post-forward logic
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1749 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/layers/chunk.py:147, in TransformerProcessorChunk.forward(self, x, shapes, batch_size, model_comm_group, **kwargs)
146 for i in range(self.num_layers):
--> 147 x = self.blocks[i](x, shapes, batch_size, model_comm_group=model_comm_group, **kwargs)
149 return (x,)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1749 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/layers/block.py:122, in TransformerProcessorBlock.forward(self, x, shapes, batch_size, model_comm_group, **layer_kwargs)
114 def forward(
115 self,
116 x: Tensor,
(...)
120 **layer_kwargs,
121 ) -> Tensor:
--> 122 x = x + self.attention(
123 self.layer_norm_attention(x, **layer_kwargs), shapes, batch_size, model_comm_group=model_comm_group
124 )
125 x = x + self.mlp(
126 self.layer_norm_mlp(
127 x,
128 **layer_kwargs,
129 )
130 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1749 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/layers/attention.py:165, in MultiHeadSelfAttention.forward(self, x, shapes, batch_size, model_comm_group)
163 key = self.k_norm(key)
--> 165 out = self.attention(
166 query,
167 key,
168 value,
169 batch_size,
170 causal=False,
171 window_size=self.window_size,
172 dropout_p=dropout_p,
173 softcap=self.softcap,
174 alibi_slopes=self.alibi_slopes,
175 )
177 out = shard_sequence(out, shapes=shapes, mgroup=model_comm_group)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1749 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/layers/attention.py:278, in FlashAttentionWrapper.forward(self, query, key, value, batch_size, causal, window_size, dropout_p, softcap, alibi_slopes)
276 alibi_slopes = alibi_slopes.repeat(batch_size, 1).to(query.device) if alibi_slopes is not None else None
--> 278 out = self.attention(
279 query,
280 key,
281 value,
282 causal=False,
283 window_size=(window_size, window_size),
284 dropout_p=dropout_p,
285 softcap=softcap,
286 alibi_slopes=alibi_slopes,
287 )
288 out = einops.rearrange(out, "batch grid heads vars -> batch heads grid vars")
TypeError: flash_attn_func() got an unexpected keyword argument 'softcap'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
File <command-4551805964158823>, line 1
----> 1 for state in runner.run(input_state=input_state, lead_time=6,):
2 print_state(state)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/inference/runner.py:222, in Runner.run(self, input_state, lead_time)
219 input_tensor = self.prepare_input_tensor(input_state)
221 try:
--> 222 yield from self.forecast(lead_time, input_tensor, input_state)
223 except (TypeError, ModuleNotFoundError, AttributeError):
224 if self.report_error:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/inference/runner.py:590, in Runner.forecast(self, lead_time, input_tensor_numpy, input_state)
584 # Predict next state of atmosphere
585 with (
586 torch.autocast(device_type=self.device, dtype=self.autocast),
587 ProfilingLabel("Predict step", self.use_profiler),
588 Timer(title),
589 ):
--> 590 y_pred = self.predict_step(self.model, input_tensor_torch, fcstep=s, step=step, date=date)
592 # Detach tensor and squeeze (should we detach here?)
593 with ProfilingLabel("Sending output to cpu", self.use_profiler):
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/inference/runner.py:486, in Runner.predict_step(self, model, input_tensor_torch, **kwargs)
482 return model.predict_step(input_tensor_torch, **kwargs)
483 except TypeError:
484 # This is for backward compatibility because old models did not
485 # have kwargs in the forward or predict_step
--> 486 return model.predict_step(input_tensor_torch)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/anemoi/models/interface/__init__.py:129, in AnemoiModelInterface.predict_step(self, batch, model_comm_group, **kwargs)
125 # Dimensions are
126 # batch, timesteps, horizonal space, variables
127 x = batch[:, 0 : self.multi_step, None, ...] # add dummy ensemble dimension as 3rd index
--> 129 y_hat = self(x, model_comm_group=model_comm_group, **kwargs)
131 return self.post_processors(y_hat, in_place=False)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
1734 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1735 else:
-> 1736 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-14c2af70-93aa-4dfc-a7b4-d21e67db127a/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
1742 # If we don't have any hooks, we want to skip the rest of the logic in
1743 # this function, and just call forward.
1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1745 or _global_backward_pre_hooks or _global_backward_hooks
1746 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747 return forward_call(*args, **kwargs)
1749 result = None
1750 called_always_called_hooks = set()
Something looks particuarly funky with your environment then if the validate fails. Please remove and rebuild your environment with the versions from the notebook.
Afterwards, with a downloaded copy of the checkpoint, runanemoi-inference validate CKPT_PATH_HERE
Btw, the real issue seems to be within the flash_attn wrapper implementation, which has changed slightly in recent versions
From further investigation, the failure to evaluate seems to be caused by the way Databricks manages python environments. For reference, I get the same error on environment validation in the AIFS-single environment, which then runs the model successfully. Environment validation through the command line tool seems to work, though.
%sh anemoi-inference validate /Volumes/cdp_dev_sandbox_catalog_01/weather_enriched/enriched/aifs-ens-crps-1.0.ckpt
2025-10-14 11:19:26 WARNING Environment validation failed. The following issues were found:
python:
Python version mismatch: 3.11.6 != 3.11.11
mismatch:
Version of module anemoi.utils was lower in training than in inference: 0.4.22 <= 0.4.37
Do you know what I should do about the flash_attn wrapper problem?
I spent a bit of time yesterday trying to understand what happens inside runner.run. Building the input tensor and manually calling the model predict step method with fcstep as a kwarg produced the following error:
TypeError: flash_attn_func() got an unexpected keyword argument 'softcap'
Hello again. Do you have any further information about how to get around the fcstep issue?
I'm really not sure, we have only see this issue with incorrect environments, this should not be happening if all is installed correctly.
Which flash_attn version do you have?
The version I've been having problems with isflash-attn 2.5.9.post1