YWZBrandon's picture
Upload folder using huggingface_hub
d2bb688 verified
[2025-05-03 14:16:02] Created output directory: train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask
[2025-05-03 14:16:02] Chat mode disabled
[2025-05-03 14:16:02] Model size is over 3B (7 B). Using LoRA training.
[2025-05-03 14:16:02] Adjusted learning rate for LoRA: 2e-4
[2025-05-03 14:16:02] No QA format data will be used
[2025-05-03 14:16:02] Limiting dataset size to: 1000 samples
[2025-05-03 14:16:02] =======================================
[2025-05-03 14:16:02] Starting training for model: /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B
[2025-05-03 14:16:02] =======================================
[2025-05-03 14:16:02] CUDA_VISIBLE_DEVICES: 6
[2025-05-03 14:16:02] WANDB_PROJECT: wikidyk-ar
[2025-05-03 14:16:02] DATA_PATH: /data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json
[2025-05-03 14:16:02] Global Batch Size: 128
[2025-05-03 14:16:02] Data Size: 1000
[2025-05-03 14:16:02] Executing command: torchrun --nproc_per_node "1" --master-port 29501 src/train.py --model_name_or_path "/data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B" --data_path "/data/yuwei/WikiDYK/data/wikidyk2022-2025_01082025_gpt-4o_evalv2_pages_formatted_combined_v2.json" --output_dir "train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask" --num_upsample "1000" --per_device_train_batch_size "64" --gradient_accumulation_steps "2" --learning_rate "2e-4" --num_train_epochs "1" --model_max_length "4096" --report_to wandb --logging_steps 50 --save_strategy steps --save_steps 10000 --save_total_limit 3 --resume_from_checkpoint True --bf16 True --use_flash_attention_2 True --qa_data_ratio "-1" --predict_mask "true" --use_lora --lora_r 32 --lora_alpha 16 --ds_size 1000
[2025-05-03 14:16:02] Training started at Sat May 3 02:16:02 PM PDT 2025
/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
2025-05-03 14:16:12.147068: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746306972.171366 1061841 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746306972.184110 1061841 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746306972.213360 1061841 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746306972.213500 1061841 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746306972.213503 1061841 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746306972.213507 1061841 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-03 14:16:12.221176: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:root:Output directory: train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:00<00:00, 50.38it/s]
trainable params: 10,092,544 || all params: 7,625,709,056 || trainable%: 0.1323
WARNING:root:Loading data...
WARNING:root:Dataset initialized with all QA data:
WARNING:root: - 0 QA examples
WARNING:root: - 1000 fact examples with upsampling factor 1000
WARNING:root: - Total examples: 1000000
/data/yuwei/WikiDYKEvalV2/src/train.py:115: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.
trainer = Trainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
E0503 14:16:33.603000 1061749 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -9) local_rank: 0 (pid: 1061841) of binary: /data/yuwei/miniconda3/envs/wikidyk/bin/python
Traceback (most recent call last):
File "/data/yuwei/miniconda3/envs/wikidyk/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/yuwei/miniconda3/envs/wikidyk/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
src/train.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-05-03_14:16:33
host : sn4622116169
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 1061841)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1061841
========================================================
[2025-05-03 14:16:34] ERROR: Training failed for /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B with exit code 1
[2025-05-03 14:16:34] ERROR: Training failed for /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B with exit code 1
[2025-05-03 14:16:34] Check error log for details: train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask/20250503_002630.log
[2025-05-03 14:16:34] Resource usage after training /data/yuwei/WikiDYK/downloaded_models/Qwen2.5-7B:
[2025-05-03 14:16:34] GPU memory usage:
80106 MiB, 81920 MiB
80234 MiB, 81920 MiB
71162 MiB, 81920 MiB
72620 MiB, 81920 MiB
63496 MiB, 81920 MiB
66876 MiB, 81920 MiB
2956 MiB, 81920 MiB
70994 MiB, 81920 MiB
[2025-05-03 14:16:36] Disk space usage for model outputs:
12K train_results_pred_mask/_data_yuwei_WikiDYK_downloaded_models_Qwen2.5-7B_ds1000_upsample1000_predict_mask
[2025-05-03 14:16:36]