Error running the example code
I am trying to run the example code in a multi-gpu setting but it's failing :(
from transformers import T5ForConditionalGeneration, AutoTokenizer
import torch
model = T5ForConditionalGeneration.from_pretrained("google/flan-ul2", device_map="auto", load_in_8bit=True)                                                                 
tokenizer = AutoTokenizer.from_pretrained("google/flan-ul2")
input_string = "Answer the following question by reasoning step by step. The cafeteria had 23 apples. If they used 20 for lunch, and bought 6 more, how many apple do they have?"                                               
inputs = tokenizer(input_string, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(inputs, max_length=200)
print(tokenizer.decode(outputs[0]))
Output:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat2 in method wrapper_mm)
Hm very weird, can you try to use the latest versions of transformers &  accelerate ? 
pip install --upgrade accelerate
pip install --upgrade git+https://github.com/huggingface/transformers.git@ùain
@will33am
	 You need to play around with device_map since UL2 includes T5 blocks with residual connections which causes an error the blocks to be split across multiple GPUs (Ref: https://github.com/huggingface/blog/blob/main/accelerate-large-models.md).
Use no_split_module_classes=["T5Block"] and also map the lm_head to the same device as the embedding layer. 
Here's an example script that works on my env with 3 low memory (16GB) GPUs.
https://github.com/akkikiki/huggingface_examples/blob/main/examples/load_flan_ul2.py
@ybelkada
	  This is the same issue as what we encountered the other day when you were working on fixing multi-gpu settings for BLIP-2 :)
https://github.com/huggingface/transformers/pull/21707
EDIT: Fixed some grammatical mistakes.
@akkikiki
	
When I try your script load_flan_ul2.py on a single 16GB GPU I get this error:
ValueError: If you want to offload some keys to cpu or disk, you need to set load_in_8bit_fp32_cpu_offload=True. Note that these modules will not be  converted to 8-bit but kept in 32-bit.
@SamuelAzran Yeah, that script assumes you have four 16 GB RAM GPUs and you need to offload it to CPU when you have only one.
@diegomontoya The reason for setting max_memory[0]=10GiB is because of moving lm_head to GPU 0 in an ad-hoc way (and loading the input tensor to GPU 0 before running forward pass). Otherwise, it'll encounter the same RuntimeError: Expected all tensors to be on the same device, when you run model.generate.
You can play around with this max memory (it does not have to be 10GiB, and there may be smarter ways of doing this), but without it, Accelerate does not consider this action of ad-hoc moving of the lm_head and causes GPU OOM on GPU 0.
@akkikiki Thanks for sharing an example script to run flan-ul2 on multi gpu. I've tried it on an instance with 4 V100 GPU (each has 16 GB memory). It didn't throw any error but the output didn't look correct to me either.
I got the following output when I run your script (without change anything except file name):
python flan_ul2_runbook.py
/opt/conda/envs/flanul2/lib/python3.8/site-packages/bitsandbytes/cuda_setup/paths.py:98: UserWarning: /opt/conda/envs/flanul2 did not contain libcudart.so as expected! Searching further paths...
  warn(
CUDA SETUP: CUDA path found: /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so
CUDA_SETUP: Detected CUDA version 116
CUDA_SETUP: Loading binary /opt/conda/envs/flanul2/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda116_nocublaslt.so...
<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
In case it helps, here are my setup:
>>> torch.__version__
'1.13.1+cu116'
>>> transformers.__version__
'4.26.1'
>>> accelerate.__version__
'0.17.1
Do you have any idea, what would be the problem?
@cyt78
	 Have a look at https://github.com/huggingface/transformers/issues/21987 :)
TL; DR: Play around with N in BitsAndBytesConfig(llm_int8_threshold=N)
@cyt78
	 N=5 worked for me. Basically it's a trade-off betw. memory usage and accuracy (on V100, which does not have int8 support on the hardware level. I believe it's different story for A100 and others).
If you are talking about https://github.com/akkikiki/huggingface_examples/blob/main/examples/load_flan_ul2.py#L17 , then it does use load on int8 with load_in_8bit=True
@akkikiki My account was created today and therefore I cant' post any more comments today. So, I'll reply with this new account:).
Yes, I was talking about the script that you pointed out and realised that it indeed use int8. My bad! I've tried with couple of different N values ranging from 1.0 to 10.0 including 5.0 and I got CUDA out of memory error every single time. I found this interesting since you mentioned that you could manage to run it on 3 GPUs with 16 gb memory each. I'm trying to run the exact same script on 4 GPUs each has 16GB memory. Can you think of any possible reason which might lead to Out of memory error in my case?
Here is the changes I did on the script to integrate your suggestion:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(llm_int8_threshold=5.0)
model = T5ForConditionalGeneration.from_pretrained(model_id, device_map=device_map, load_in_8bit=True,quantization_config=quantization_config)
@cyt79
	 Yeah, looks like there's more memory usage if the quantization threshold is lower (not 100% sure why), so the 3 GPUs example is just without setting BitsAndBytesConfig(llm_int8_threshold=5.0).  Best to play around with lowering max_memory to avoid it (and CPU offloading if needed).
@akkikiki
	 Thanks for sharing the example code and all explanations! I would like to ask a relevant question about specifying the no_split_module_classes parameter since there's no official documentation of infer_auto_device_map. I looked into their source code, and it seems that it gets the components of the model by checking model.named_parameters(). However, I didn't find anything has the name T5Block in those parameters of flan_ul2. I wonder if it's defined elsewhere and how you find a way to specify this parameter properly?
I also tried to run the original T5 model basically by replacing flan_ul2 with t5-large, but it triggered the Expected all tensors to be on the same device error too, and this time specifying no_split_module_classes=["T5Block"] wouldn't help, neither did moving lm_head to gpu:0. Does that indicate each time we want to run a model we'll have to check the source code and look for the residual connections and preserve them by specifying no_split_module_classes? Thanks!
EDIT: There is no problem doing inference with t5-large by swapping  flan_ul2 with t5-large. The multiple device error was caused by the fine-tuning part of my code.
@YzyLmc
	 I believe you should share your script on t5-large to share more context since I did not have any trouble with swapping flan_ul2 with t5-large. As long as it's in the same T5 family, it's not the issue with no_split_module_classes  and the cause is different.
How I found out is basically "connecting the dots" from the Accelerate documentation (https://github.com/huggingface/blog/blob/main/accelerate-large-models.md) on OPTDecoderLayer,  reading the BLOOM blog esp. on the naive pipeline parallelism section (https://huggingface.co/blog/bloom-megatron-deepspeed#pipeline-parallelism) to understand the basic assumption that layers should not be dispatched across multiple GPUs, my experience on the fix in loading Blip-2 Flan-T5-XL with multiple GPUs https://github.com/huggingface/transformers/pull/21707 (hinting from what ybelkada@ raised as a warning in his PR), and getting my hands dirty by actually debugging through printing out the named_parameters.
I believe with other types of multi-gpu parallelism (e.g., Tensor Parallelism or TP) where we do not have to assume that same layer (or specifically the weight Tensor associated with that layer) are be on same GPU (I guess, not an expert with TP so somebody correct me if I'm wrong ), then probably we do not have to much care about the no_split_module_classes but somebody have to teach us how in a simple way :)
@akkikiki
	 Thanks for your quick response! I just ran more tests, and you are totally correct that no_split_module_classes wasn't the issue in my case. I was trying to fine-tune the model, and it was the fine-tuning part that caused this error, which is a separate issue, and the inference worked perfectly with swapping flan_ul2 with t5-large.  I'll edit my earlier post. Sorry about that!
Also thank you for sharing your experience and insights. I can imagine how much effort you have put into this to make it work, and I hope huggingface people will make clear documentation on this to make it less burdensome.
@akkikiki Many thanks for your reply! I've moved to a bigger instance which has 8 V100 GPUs (each has 32GB memory). Here, I could run the official example code in the model card tab in btfloat16 and got the exacted result. Then, I tried to run your script again. This time, I got a different error:
Traceback (most recent call last):
  File "flan_runbook.py", line 18, in <module>
    device_map['lm_head'] = device_map["decoder.embed_tokens"]
KeyError: 'decoder.embed_tokens'
I guess "decoder.embed_tokens" has been renamed after you implemented this script. Do you know where can I check the latest name of this?
@akkikiki When possible, can you give me some pointers on how to fix the above error? Many thanks in advance!
@akkikiki I ran inference on blip2_flant5xxl model in a two 3090 environment. Following (https://github.com/huggingface/transformers/pull/21707), I use
configuration = Blip2Config.from_pretrained("Salesforce/blip2-flan-t5-xxl")
with init_empty_weights():
    model = Blip2ForConditionalGeneration(configuration)
    device_map = infer_auto_device_map(model, no_split_module_classes=["T5Block"], max_memory={0: "24GiB", 1: "24GiB"})
device_map['language_model.lm_head'] = device_map["language_model.decoder.embed_tokens"]  # to make the genearted tokens and input_ids to be on the same device
model = Blip2ForConditionalGeneration(configuration).from_pretrained("Salesforce/blip2-flan-t5-xxl", torch_dtype=torch.float16, device_map=device_map, cache_dir="/mnt/14T-disk/code/HF_model/hub")
and device_map is:
{'query_tokens': 0, 'vision_model': 0, 'qformer': 0, 'language_projection': 0, 'language_model.shared': 0, 'language_model.decoder.embed_tokens': 0, 'language_model.encoder': 0, 'language_model.decoder.block.0': 0, 'language_model.decoder.block.1': 1, 'language_model.decoder.block.2': 1, 'language_model.decoder.block.3': 1, 'language_model.decoder.block.4': 1, 'language_model.decoder.block.5': 1, 'language_model.decoder.block.6': 1, 'language_model.decoder.block.7': 1, 'language_model.decoder.block.8': 1, 'language_model.decoder.block.9': 1, 'language_model.decoder.block.10': 1, 'language_model.decoder.block.11': 1, 'language_model.decoder.block.12': 1, 'language_model.decoder.block.13': 1, 'language_model.decoder.block.14': 1, 'language_model.decoder.block.15': 1, 'language_model.decoder.block.16': 1, 'language_model.decoder.block.17': 1, 'language_model.decoder.block.18': 1, 'language_model.decoder.block.19': 1, 'language_model.decoder.block.20': 1, 'language_model.decoder.block.21': 1, 'language_model.decoder.block.22': 1, 'language_model.decoder.block.23': 1, 'language_model.decoder.final_layer_norm': 1, 'language_model.decoder.dropout': 1, 'language_model.lm_head': 0}
 However, when I executed the inference, I received the following error message but still got the inference result of the model. I don't understand why this happened. Is the result reliable in this case?Thank you.
--- Logging error ---
Traceback (most recent call last):
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/logging/init.py", line 1100, in emit
    msg = self.format(record)
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/logging/init.py", line 943, in format
    return fmt.format(record)
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/logging/init.py", line 678, in format
    record.message = record.getMessage()
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/logging/init.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel_launcher.py", line 17, in 
    app.launch_new_instance()
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 711, in start
    self.io_loop.start()
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 215, in start
    self.asyncio_loop.run_forever()
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/asyncio/base_events.py", line 1906, in _run_once
    handle._run()
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 510, in dispatch_queue
    await self.process_one()
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 499, in process_one
    await dispatch(*args)
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 406, in dispatch_shell
    await result
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 729, in execute_request
    reply_content = await reply_content
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 411, in do_execute
    res = shell.run_cell(
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 531, in run_cell
    return super().run_cell(*args, **kwargs)
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 2961, in run_cell
    result = self._run_cell(
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3016, in _run_cell
    result = runner(coro)
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129, in pseudo_sync_runner
    coro.send(None)
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3221, in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3400, in run_ast_nodes
    if await self.run_code(code, result, async=asy):
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3460, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "/tmp/ipykernel_1716702/309436947.py", line 11, in 
    out = model.generate(**inputs)
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 1805, in generate
    self._preprocess_accelerate()
  File "/home/whl/anaconda3/envs/blip2/lib/python3.10/site-packages/transformers/models/blip_2/modeling_blip_2.py", line 1607, in _preprocess_accelerate
    logger.warning(
Message: 'The language_model is not in the hf_device_map dictionary and you are running your script in a multi-GPU environment. this may lead to unexpected behavior when using accelerate. Please pass a device_map that contains language_model to remove this warning. Please refer to https://github.com/huggingface/blog/blob/main/accelerate-large-models.md for'
Arguments: (' more details on creating a device_map for large models.',)
@WHL95 Probably good to start a thread in the BLIP-2 model community rather than here :)
But not sure what is happening there. Maybe an error related to log parsing? If you can share the full script, that would help.
@WHL95 did you get any solution for this? I am also getting the same error.
The following solution worked for me. The error was due to the splitting of language model layers between available GPU space.
import torch
from transformers import (
    Blip2VisionConfig,
    Blip2QFormerConfig,
    OPTConfig,
    Blip2Config,
    Blip2ForConditionalGeneration,
    Blip2Processor,
)
from accelerate import init_empty_weights, infer_auto_device_map
from accelerate.utils import get_balanced_memory
model_id = "Salesforce/blip2-opt-6.7b"
config = Blip2Config.from_pretrained(model_id)
processor = Blip2Processor.from_pretrained(model_id )
with init_empty_weights():
    model = Blip2ForConditionalGeneration(config)
    max_memory = get_balanced_memory(model, max_memory=None, no_split_module_classes=["OPTDecoderLayer", "Attention", "MLP", "LayerNorm", "Linear"], dtype=torch.float16, low_zero=False,)
    device_map = infer_auto_device_map(model, no_split_module_classes=["OPTDecoderLayer", "Attention", "MLP", "LayerNorm", "Linear"], dtype=torch.float16, max_memory=max_memory)
device_map['language_model.lm_head'] = device_map['language_model.model.decoder.embed_tokens']
model = Blip2ForConditionalGeneration.from_pretrained(model_id , device_map=device_map, torch_dtype=torch.float16)

 
						 
						