ValueError with multi A100 GPUS

#28
by saireddy - opened

anyone facing this issue with A100 multi gpus
ValueError: You can't train a model that has been loaded in 8-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device() or device_map={'':torch.xpu.current_device()}
I am using "auto" for device map, still hitting this issue

Google org

Hi @saireddy , Could you please have a look at this similar issue, seems duplicate? Please let us know if the issue still persists. Thank you.

saireddy changed discussion status to closed

hi @saireddy you might find Impulse AI (https://www.impulselabs.ai/) useful. we make it super easy to fine-tune and deploy open source models. hopefully you find it helpful! i know not relevant to your problem above but might be easier to use us to fine tune and deploy

docs: https://docs.impulselabs.ai/introduction
python sdk: https://pypi.org/project/impulse-api-sdk-python/

Sign up or log in to comment